Search code examples
pythonpydantic

How to correctly type a Pydantic model to handle string input for a list[float] with validation before initialization?


I'm using Pydantic to define a model where one of the fields, embedding, is expected to be a list[float]. However, I want to be able to pass a string to this field, and then have a validator transform this string into a list[float] before initialization.

Here's the code I'm working with:

from pydantic import BaseModel, field_validator
import uuid

class ChunkInsert(BaseModel):
    embedding: list[float]
    file_id: uuid.UUID

    @field_validator(
        "embedding",
        mode="before",
    )
    @classmethod
    def embed_files(cls, value: str) -> list[float]:
        return embed_text(value)[0]

chunk_in = ChunkInsert(
    embedding="a",
    file_id=uuid.UUID("987f5c8a-5577-4662-be1d-cb1ba016f6f5"),
)

The code works as expected, and embed_files processes the string and converts it into a list[float]. However, I'm getting the following type error in VS Code from Pylance:

Argument of type "Literal['a']" cannot be assigned to parameter "embedding" of type "list[float]" in function "init" "Literal['a']" is incompatible with "list[float]"PylancereportArgumentType

It seems like Pylance is not recognizing that the embedding field should be processed by the embed_files validator before the type check.

So my question is: is there a way to configure Pydantic or Pylance so that this kind of pre-initialization validation doesn't trigger a type error?

Edit: since pylance is a static type checker and I am dynamically changing the type before the model creation, is this even possible?


Solution

  • Here is one solution that probably does what you want. Note that if you wanted to calculate the embedding only if accessed you could turn the property into a cached_property and calculate it there. I'm sort of assuming that you might want to pass in the embedding sometimes, so I've included that functionality in the solution...

    from typing import Self
    from pydantic import BaseModel, Field, computed_field, model_validator
    
    
    class ChunkInsert(BaseModel):
        text: str
        embedding_: list[float] | None = Field(default=None, exclude=True, repr=False)
    
        @computed_field
        @property
        def embedding(self) -> list[float]:
            assert self.embedding_
            return self.embedding_
    
        @model_validator(mode="after")
        def embed_files(self) -> Self:
            if not self.embedding_:
                self.embedding_ = [1.0]
            return self
    
    
    # all of these make the typechecker happy
    
    print(ChunkInsert(text="foobar"))
    print(ChunkInsert(text="moodbar", embedding_=[1.0]))
    print(ChunkInsert(text="foobar").model_dump())
    print(ChunkInsert(text="moobar", embedding_=[1.0]).model_dump())
    print(ChunkInsert(text="moodbar", embedding_=[1.0]).embedding[0])
    
    # output:
    # text='foobar' embedding=[1.0]
    # text='moodbar' embedding=[1.0]
    # {'text': 'foobar', 'embedding': [1.0]}
    # {'text': 'moobar', 'embedding': [1.0]}
    # 1.0
    

    For anyone who wants a slightly less correct but less verbose solution, the following idea would do the same:

    from typing import Self
    from pydantic import BaseModel, Field, computed_field, model_validator
    
    
    class ChunkInsert(BaseModel):
        text: str
        embedding: list[float] = []
    
        @model_validator(mode="after")
        def embed_files(self) -> Self:
            if not self.embedding:
                self.embedding = [1.0]
            return self
    
    # same happiness, same output