I'm using the Great Expectations python package (version 0.14.10) to validate some data. I've already followed the provided tutorials and created a great_expectations.yml in the local ./great_expectations
folder. I've also created a great expectations suite based on a .csv file version of the data (call this file ge_suite.json
).
GOAL: I want to use the ge_suite.json
file to validate an in-memory pandas DataFrame.
I've tried following this SO question answer with code that looks like this:
import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext
context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")
My datasources section of my great_expectations.yml file looks like this:
datasources:
my_datasource_name:
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
module_name: great_expectations.datasource.data_connector
base_directory: /tmp
class_name: InferredAssetFilesystemDataConnector
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
When I run the batch = context.get_batch(...
command in python I get the following error:
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'
I'm assuming that I need to add something to the definition of the datasource in the great_expectations.yml file to fix this. Or, could it be a versioning issue? I'm not sure. I looked around for a while in the online documentation and didn't find an answer. How do I achieve the "GOAL" (defined above) and get past this error?
If you want to validate an in-memory pandas dataframe you can reference the following 2 pages for information on how to do that:
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/
To give a concrete example in code though, you can do something like this:
import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest
context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')
suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'
batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name",
data_connector_name="default_runtime_data_connector_name",
data_asset_name=data_asset_name,
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": batch_id}, )
# context.run_checkpoint method looks for checkpoint file on disk. Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
f.write(checkpoint_yml)
result = context.run_checkpoint(
checkpoint_name=checkpoint_name,
validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)