Search code examples
pythonpandasdataframepandera

Pandera checks for multiple columns


I am writing a Python script to validate columns in Pandas DataFrame using Pandera library. My issue is that I am OK validating one column at a time, but I was unable to find out how I can validate values in one column taking into consideration values from another column in the dataframe.

Specifically, I want to check if values exist in a column, and if they do I also want to check if values in another column for that row also contain certain text.

import numpy as np
import pandas as pd
import pandera as pa

dataframe = pd.DataFrame({'column_A': ['ABC company', 'BBB company', 'ABC company', 'CCC company'],
                          'column_B': ['1000', np.NaN, '2000', np.NaN]
                          })

schema = pa.DataFrameSchema(
    columns={
        'column_A': pa.Column(pa.String),
        'column_B': pa.Column(pa.String, nullable=True,
                              checks=pa.Check(
                                  lambda df: (df['column_A'].str.contains('ABC')) & (~df['column_B'].isna())))
    }
)

schema.validate(dataframe)

This was my attempt at solving the issue but Pandera is not aware of the fact that I want to examine multiple series during validation of column_B.


Solution

  • If you want to check (look for # define a wide DataFrame-level check in the documentation) at the dataframe level, you need to do your test outside of a column:

    import numpy as np
    import pandas as pd
    import pandera as pa
    
    dataframe = pd.DataFrame({'column_A': ['ABC company', 'BBB company', 'ABC company', 'CCC company'],
                              'column_B': ['1000', np.NaN, '2000', np.NaN]
                              })
    
    # define your dataframe level test
    check_AB = pa.Check(
        lambda df: (df['column_A'].str.contains('ABC')) & (~df['column_B'].isna()),
        name='check_AB'
    )
    
    schema = pa.DataFrameSchema(
        columns={
            'column_A': pa.Column(pa.String),
            'column_B': pa.Column(pa.String, nullable=True)
        },
        checks=check_AB  # <- dataframe wide check
    )
    

    Output:

    >>> schema.validate(dataframe)
    ...
    <Check check_AB>
    failure cases:
         column  index failure_case
    0  column_A      1  BBB company
    1  column_A      3  CCC company