Regarding Great Expectations
I want to create a custom expectation to validate if there are multiple unique observations of id_client based on a given id_product key in a DataFrame.
After set up my Great Expectations project, I'm having trouble figuring out how to define and implement a custom expectation for this specific validation.
Here is a Data Sample:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({
'id_product': [1, 1, 2, 2, 2, 3, 3],
'id_client': [101, 102, 201, 202, 203, 301, 301]
})
This is the validation I can do in pandas but not in great expectations:
def count_unique_rows(df, id_column, other_column):
unique_rows = df.groupby([id_column, other_column]).size().reset_index()
count = unique_rows.groupby(id_column).size().reset_index(name='count')
return count
assert any(count_unique_rows(df, 'id'_product, 'id_client')['count'] > 1)
Basically I want to study if there is any data inconsistence by setting up a condition
You could add a custom excpectation as this one :
import great_expectations as gx
from great_expectations.dataset import (
PandasDataset,
MetaPandasDataset,
)
class MyCustomPandasDataset(PandasDataset):
_data_asset_type = "MyCustomPandasDataset"
@MetaPandasDataset.column_map_expectation
def expect_unique_pair(self, column):
is_pair_unique_df=(self.groupby(['id_product', 'id_client']).size().to_frame('size') > 1).reset_index()
return pd.merge(self, is_pair_unique_df, on=['id_product', 'id_client'], how="left")["size"]
my_validated_df = gx.from_pandas(df, dataset_class=MyCustomPandasDataset)
print(my_validated_df.expect_unique_pair('id_client'))
The expect_unique_pair
method will check against the given customPandasDataset for uniqueness of the key [id_product, id_client]. It returns a series of boolean wether the pair is unique or not.