Search code examples
pythonpandasdataframegreat-expectations

How to pass a CustomDataAsset to a DataContext to run custom expectations on a batch?


I have a CustomPandasDataset with a custom expectation

from great_expectations.data_asset import DataAsset
from great_expectations.dataset import PandasDataset
from datetime import date, datetime, timedelta

class CustomPandasDataset(PandasDataset):

    _data_asset_type = "CustomPandasDataset"
      
    @DataAsset.expectation(["column", "datetime_match", "datetime_diff"])
    def expect_column_max_value_to_match_datetime(self, column:str, datetime_match: datetime = None, datetime_diff: tuple = None) -> dict:
        """
        Check if data is constantly updated by matching the max datetime column to a
        datetime value or to a datetime difference.
        """
        max_datetime = self[column].max()

        if datetime_match is None:

            from datetime import date

            datetime_match = date.today()

        if datetime_diff:
            
            from datetime import timedelta

            success = (datetime_match - timedelta(*datetime_diff)) <= max_datetime <= datetime_match

        else:

            success = (max_datetime == datetime_match)

        result = {
            "data_max_value": max_datetime,
            "expected_max_value": str(datetime_match),
            "expected_datetime_diff": datetime_diff
        }

        return {
            "success": success,
            "result": result
        }

I want to run the expectation expect_column_max_value_to_match_datetime to a given pandas dataframe

expectation_suite_name = "df-raw-expectations"

suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

df_ge = ge.from_pandas(df, dataset_class=CustomPandasDataset)

batch_kwargs = {'dataset': df_ge, 'datasource': 'df_raw_datasource'}

# Get batch of data
batch = context.get_batch(batch_kwargs, suite)

which I get from a DataContext, now when I run expectations on this batch

datetime_diff = 4,
batch.expect_column_max_value_to_match_datetime(column='DATE', datetime_diff=datetime_diff)

I got the following error

AttributeError: 'PandasDataset' object has no attribute 'expect_column_max_value_to_match_datetime'

According to the docs I've specified the dataset_class=CustomPandasDataset attribute when constructing the GreatExpectations dataset, indeed running the expectations on df_ge works but not on the batch of data.


Solution

  • According to the docs

    To use custom expectations in a datasource or DataContext, you need to define the custom DataAsset in the datasource configuration or batch_kwargs for a specific batch.

    so pass CustomPandasDataset through the data_asset_type parameter of get_batch() function

    # Get batch of data
    batch = context.get_batch(batch_kwargs, suite, data_asset_type=CustomPandasDataset)
    

    or define it in the context Configuration

    from great_expectations.data_context.types.base import DataContextConfig
    from great_expectations.data_context import BaseDataContext
    
    data_context_config = DataContextConfig(
        ...
        datasources={
            "sales_raw_datasource": {
                "data_asset_type": {
                    "class_name": "CustomPandasDataset",
                    "module_name": "custom_dataset",
                },
                "class_name": "PandasDatasource",
                "module_name": "great_expectations.datasource",
            }
        },
        ... 
        )
    context = BaseDataContext(project_config=data_context_config)
    

    where CustomPandasDataset is available from the module/script custom_dataset.py