Search code examples
pydanticpython-polars

How to serialize a polars dataframe type in a pydantic v2 basemodel?


I have a pydantic (v2) BaseModel that can take a polars DataFrame as one of its model fields. I wish to be able to serialize the dataframe. Preferably, I would be able to serialize AND de-serialize it, but I would be happy with just being able to serialize it.

The polars dataframe has a df.write_json() method. My thinking has been that I would take the json output from that method and read it back in via the python json library, so that it becomes a json-serializeable dict. Then I would somehow attach this "encoder" to the pydantic json method. For the deserialization process, I would use the pl.read_json() method to produce a dataframe.

Unfortunately, in the pydantic documentation, I can tell how to write a custom serializer for a named field, but not for a given type.

There are some docs on serializing subclasses by introducing a __get_pydantic_core_schema__ class method, but I would prefer to avoid this approach, since I would like to be able to use the polars classes directly.

Here is an example where currently, Foo().model_dump_json() results in a PydanticSerializationError: Unable to serialize unknown type: <class 'polars.dataframe.frame.DataFrame'> error.

from typing import Any
from pydantic import BaseModel
import polars as pl
import json

df = pl.DataFrame({"foo":[1,2,3], "bar":[4,5,6]})
df.write_json() # this produces a json representation of my dataframe
# {"columns":[{"name":"foo","datatype":"Int64","bit_settings":"","values":[1,2,3]},{"name":"bar","datatype":"Int64","bit_settings":"","values":[4,5,6]}]}

# I could use pl.read_json() to read it back into a dataframe.

def json_serializable_dataframe(df: pl.DataFrame) -> dict[str, Any]:
    "Load serialized dataframe into a serializable dict."
    return json.loads(df.write_json())

class Foo(BaseModel, arbitrary_types_allowed=True):
    df: pl.DataFrame = pl.DataFrame({"foo":[1,2,3], "bar":[4,5,6]})


Foo().model_dump_json() # how to incorporate my json_serializable_dataframe encoder here?

Is there a way to give pydantic the ability to serialize a custom type?


Solution

  • Can you use @model_serializer and manually look for DataFrames?

    from pydantic import BaseModel
    
    class Foo(BaseModel, arbitrary_types_allowed=True):
        a: pl.DataFrame = pl.DataFrame({"foo":[1], "bar":[2]})
        b: pl.DataFrame = pl.DataFrame({"baz":[3], "omg":[4]})
    
        @model_serializer
        def serialize(self):
            for name, obj in self.__dict__.items():
                if isinstance(obj, pl.DataFrame):
                    self.__dict__[name] = obj.lazy().serialize()
            return self.__dict__
    
    Foo().model_dump_json()
    
    '{"a":"{\\"DataFrameScan\\":{\\"df\\":{\\"columns\\":[{\\"name\\":\\"foo\\"...
    

    note: Polars offers frame (de-)serialization via: