Search code examples
pythonpandera

Checking for units with Pandera


Recently started using Pandera; what an excellent Python Package!

Does anyone know if it is possible to include so-called metadata of a column into the SchemaModel of a dataframe? For instance, add the unit of a column (seconds, kilometers, etc.).

Consider the following situation. I have two pandas Dataframes (say df1 and df2), each with a column distance. Now suppose I merge these two dataframes over some keys into a new dataframe called df_merged, and then take the sum of both distance columns. It would be great to validate whether both distance units are equal (e.g., both km, or both cm) when validating the resulting dataframe.

I guess it would mean that the Input schemas of df1 and df2 would include some kind of metadata of the distance columns, and that Pandera checks whether the units are compatible.

Is this possible with Pandera, or do I need to implement this differently?


Solution

  • You didn't give example data, so I'll assume that the metadata is just another column in the schema. If you only want to check if a dataframe is valid, that's straight-forward with the existing Check interface:

    import pandera as pa
    from pandera.typing import Series
    from pandera import extensions
    import pandas as pd
    
    
    @extensions.register_check_method(supported_types=pd.Series)
    def uniform(series: pd.Series):
        uniques = series.unique()
        if len(uniques) != 1:
            return pd.Series([False] * len(series))
    
    
    class Schema(pa.SchemaModel):
        distance: Series[float]
        unit: Series[str] = pa.Field(uniform=())
    

    Example:

    >>> Schema.validate(pd.DataFrame({
    ...     "distance": [1212., 3431., 4.],
    ...     "unit": ["m", "m", "km"],
    ... }))
    Traceback (most recent call last):
        [...]
        raise errors.SchemaError(
    pandera.errors.SchemaError: <Schema Column(name=unit, type=DataType(str))> failed element-wise validator 0:
    <Check uniform>
    failure cases:
       index failure_case
    0      0            m
    1      1            m
    2      2           km
    

    If you want to also handle the ensuing transformation (e.g. "find the most frequent unit and try to convert all other rows into it") in pandera, you're out of luck. It's been proposed multiple times and has a good chance of getting implemented, but it's not there yet.