Search code examples
pythonpython-3.xpydantic

Automatically merging multiple Pydantic models with overlapping fields


It is kind of difficult to accurately phrase my question in one sentence.

I have the following models:

from pydantic import BaseModel


class Detail1(BaseModel):
    round: bool
    volume: float


class AppleData1(BaseModel):
    origin: str
    detail: Detail1


class Detail2(BaseModel):
    round: bool
    weight: float


class AppleData2(BaseModel):
    origin: str
    detail: Detail2

Here AppleData1 has an attribute detail which is of the type Detail1. AppleData2 has an attribute detail which is of the type Detail2. I want to make an Apple class which contains all the attributes of AppleData1 and AppleData2.

Question (How to implement the algorithm?)

Do you have a generic approach to implement this algorithm:

  • Whenever AppleData1 and AppleData2 have an attribute of the same name:

    • If they are of the same type, use one of them. For example, AppleData1.origin and AppleData2.origin are both of the type str. So Apple.origin is also of type str.

    • If they are of different types, merge them. For example, AppleData1.detail and AppleData2.detail, they are of type Detail1 and Detail2 respectively. So Apple.detail should contain all the inner attributes.

  • Any common inner attribute is always for the same physical quantity. So overwriting is allowed. For example, Detail1.round and Detail2.round are both of type bool. So the resulting Apple.detail.round is also of type bool.

Expect Results

The end results should be equivalent to the Apple model below. (The definition of Detail class below is only used to make the code below complete. The generic approach should not hard-code the Detail class.)

class Detail(BaseModel):
    round: bool
    volume: float    
    weight: float

class Apple(BaseModel):
    origin: str
    detail: Detail

My Solution (bad example)

class Detail(Detail1, Detail2):
    pass


class Apple(AppleData1, AppleData2):
    origin: str
    detail: Detail

print(Apple.schema_json())

This solution works but it is too-specific.

  1. Here I need to pin-point that detail attribute from AppleData1 and AppleData2, and specifically create the Detail class from specifically Detail1 and Detail2.

  2. I need to pin-point that origin is a common attribute of the same type (str). So I specifically hard-coded origin: str in the definition of the Apple class.


Solution

  • Simplified solution

    Implementing a custom recursive version of the create_model function to dynamically construct a "combined" model class should work:

    from typing import TypeGuard, TypeVar
    from pydantic import BaseModel, create_model
    from pydantic.fields import SHAPE_SINGLETON
    
    M = TypeVar("M", bound=BaseModel)
    
    
    def is_pydantic_model(obj: object) -> TypeGuard[type[BaseModel]]:
        return isinstance(obj, type) and issubclass(obj, BaseModel)
    
    
    def create_combined_model(
        __name__: str,
        /,
        model1: type[M],
        model2: type[M],
    ) -> type[M]:
        field_overrides = {}
        for name, field1 in model1.__fields__.items():
            field2 = model2.__fields__.get(name)
            if field2 is None:
                continue
            if is_pydantic_model(field1.type_):
                assert field1.shape == SHAPE_SINGLETON, "No model collections allowed"
                assert is_pydantic_model(field2.type_), f"{name} with different types"
                sub_model = create_combined_model(
                    f"Combined{field1.type_.__name__}{field2.type_.__name__}",
                    field1.type_,
                    field2.type_,
                )
                field_overrides[name] = (sub_model, field1.field_info)
            else:
                assert field1.annotation == field2.annotation, f"Different types"
        return create_model(__name__, __base__=(model1, model2), **field_overrides)  # type: ignore
    

    This incorporates your restrictions/assumptions about the models that can be combined that you elaborated on in your comments.

    It does not support combining fields that are annotated with C[M], where C is any generic collection type and M is a subclass of BaseModel. That is what the SHAPE_SINGLETON check assures. It would possible to incorporate logic that allows combining models and retaining the shape of the field (e.g. list[Detail1] and list[Detail2]), but I left that out because you did not ask for that explicitly and it is a bit more complicated.

    Demo

    from pydantic import BaseModel
    
    
    class AppleBase(BaseModel):
        foo: str
    
    
    class DetailBase(BaseModel):
        round: bool
    
    
    class Detail1(DetailBase):
        volume: float
    
    
    class AppleData1(AppleBase):
        bar: int
        detail: Detail1
    
    
    class Detail2(DetailBase):
        weight: float
    
    
    class AppleData2(AppleBase):
        baz: float
        detail: Detail2
    
    
    Apple = create_combined_model("Apple", AppleData1, AppleData2)
    print(Apple.schema_json(indent=4))
    

    Output

    {
        "title": "Apple",
        "type": "object",
        "properties": {
            "foo": {
                "title": "Foo",
                "type": "string"
            },
            "baz": {
                "title": "Baz",
                "type": "number"
            },
            "detail": {
                "$ref": "#/definitions/CombinedDetail1Detail2"
            },
            "bar": {
                "title": "Bar",
                "type": "integer"
            }
        },
        "required": [
            "foo",
            "baz",
            "detail",
            "bar"
        ],
        "definitions": {
            "CombinedDetail1Detail2": {
                "title": "CombinedDetail1Detail2",
                "type": "object",
                "properties": {
                    "round": {
                        "title": "Round",
                        "type": "boolean"
                    },
                    "weight": {
                        "title": "Weight",
                        "type": "number"
                    },
                    "volume": {
                        "title": "Volume",
                        "type": "number"
                    }
                },
                "required": [
                    "round",
                    "weight",
                    "volume"
                ]
            }
        }
    }
    

    Caveats

    An obvious drawback to this solution is that because it dynamically creates the model class, it is impossible to properly convey the type of the resulting model in terms of static analysis.

    The way I wrote it now, the function is generic to the greatest extent possible in that the returned type will be inferred as either the joined or the union type, depending on the static type checker, of the two input models model1 and model2.

    In the demo example this means some type checkers like Mypy for example will infer the type of Apple to be AppleBase (join). This is of course not wrong, but it is not as specific as we might like because it fails to account for the existence of the bar, baz, and detail attributes.

    A type checker that uses unions instead might infer the type as AppleData1 | AppleData2 instead. (I have not tested it, but I believe Pyright does this.) This may or may not be preferable, because it would at least always cover the existence of a detail attribute (albeit with yet another union type of Detail1 | Detail2), but it would be ambiguous whether or not Apple has a bar or a baz attribute to such a type checker.

    The ideal solution would be to define the return type as the intersection of the two model types passed into it. But unfortunately we do not have that typing construct (yet).

    All of this has no effect on the runtime behavior of the constructed class of course, but it is not ideal for IDE auto-suggestions for example.

    Consequently, your initial explicit approach of using multiple inheritance for all the models involved is still something I would recommend, unless your models become very large/complex and numerous.