Search code examples
pythonpandasgreat-expectations

Use Great Expectations to validate pandas DataFrame with existing suite JSON


I'm using the Great Expectations python package (version 0.14.10) to validate some data. I've already followed the provided tutorials and created a great_expectations.yml in the local ./great_expectations folder. I've also created a great expectations suite based on a .csv file version of the data (call this file ge_suite.json).

GOAL: I want to use the ge_suite.json file to validate an in-memory pandas DataFrame.

I've tried following this SO question answer with code that looks like this:

import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext

context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")

My datasources section of my great_expectations.yml file looks like this:

datasources:
  my_datasource_name:
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: PandasExecutionEngine
    module_name: great_expectations.datasource
    class_name: Datasource
    data_connectors:
      default_inferred_data_connector_name:
        module_name: great_expectations.datasource.data_connector
        base_directory: /tmp
        class_name: InferredAssetFilesystemDataConnector
        default_regex:
          group_names:
            - data_asset_name
          pattern: (.*)
      default_runtime_data_connector_name:
        batch_identifiers:
          - default_identifier_name
        module_name: great_expectations.datasource.data_connector
        class_name: RuntimeDataConnector

When I run the batch = context.get_batch(... command in python I get the following error:

File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
  return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
  batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'

I'm assuming that I need to add something to the definition of the datasource in the great_expectations.yml file to fix this. Or, could it be a versioning issue? I'm not sure. I looked around for a while in the online documentation and didn't find an answer. How do I achieve the "GOAL" (defined above) and get past this error?


Solution

  • If you want to validate an in-memory pandas dataframe you can reference the following 2 pages for information on how to do that:

    https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/

    https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe/

    To give a concrete example in code though, you can do something like this:

    import great_expectations as ge
    import os
    import pandas as pd
    from great_expectations.core.batch import RuntimeBatchRequest
    
    context = ge.get_context()
    df = pd.read_pickle('/path/to/my/df.pkl')
    
    suite_name = 'ge_suite'
    data_asset_name = 'your_data_asset_name'
    batch_id = 'your_batch_id'
    
    batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name", 
                                        data_connector_name="default_runtime_data_connector_name",
                                        data_asset_name=data_asset_name,
                                        runtime_parameters={"batch_data": df},
                                        batch_identifiers={"default_identifier_name": batch_id}, )
    
    # context.run_checkpoint method looks for checkpoint file on disk.  Create one...
    checkpoint_name = 'your_checkpoint_name'
    checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
    checkpoint_yml = f'''
    name: {checkpoint_name}
    config_version: 1
    class_name: SimpleCheckpoint
    expectation_suite_name: {suite_name}
    '''
    with open(checkpoint_path, 'w') as f:
        f.write(checkpoint_yml)
    
    result = context.run_checkpoint(
        checkpoint_name=checkpoint_name,
        validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
    )