Search code examples
pandasgreat-expectations

How to validate values within column of arrays with Great Expectations?


This is the very first time I work with GX, so this might be some simple question.

I have this pandas DataFrame, made of a column of strings, and another one of arrays:

df = pd.DataFrame(
    data={
        "request_id": [1, 2, 3, 4, 5],
        "failure_reasons": [
            [],
            [
                "reason_5"
            ],
            [
                "reason_2",
                "reason_3"
            ],
            [
                "reason_1",
                "reason_2",
                "reason_3",
                "reason_4",
                "reason_5"
            ],
            [],
        ]
    }
)

My goal here is to check if the distinct values in failure_reasons belong to an expected set. I've tried the code below, after some search over the Expectations Gallery:

df.to_parquet("../gx_datasets/data.parquet")

import great_expectations as gx

context = gx.get_context()

validator = context.sources.pandas_default.read_parquet("../gx_datasets/data.parquet")
    
validator.expect_column_distinct_values_to_be_in_set(
    column="failure_reasons",
    value_set=[
        "reason_1",
        "reason_2",
        "reason_3",
        "reason_4",
        "reason_5"
    ]
)

It gave me this error:

MetricResolutionError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

And I'm not sure how to deal with it, since it's my first time with GX. It's obvious to me that GX is checking the rows element-wise, but I can't figure out what I should do for it to check the values itself. Do I need to transform the DataFrame first every time? Or do I need to take another approach?

I couldn't find anything similar to my question here or anywhere else.


Solution

  • I think you're exactly right about what's happening. The easiest way to get that is to transform your data first.

    df_validate = df.explode("failure_reasons")
    context = gx.get_context()
    validator = context.sources.pandas_default.read_dataframe(df_validate)
    validator.expect_column_distinct_values_to_be_in_set(
        column="failure_reasons",
        value_set=[
            "reason_1",
            "reason_2",
            "reason_3",
            "reason_4",
            "reason_5"
        ]
    )
    

    You could also build a custom expectation.

    That said -- I'm from the GX team, and it makes sense to me to support this case directly, so I'll look at adding an explicit expectation for that.