Probably a very simple question but I could not figure it out from the documentation of great_expectations. I would like to run some tests on a pandas dataframe that is stored locally as a pickled file '.pkl'.
When I ran great_expectations add-datasource
it ignored the .pkl files and only created assets for .csv files.
Reading csv files from pandas is slow, so it would be great if GE could support other formats like pickle and HDF.
How to load .pkl or .hdf files as GE's assets?
I'm using v0.8.7 :)
For pandas (and spark), there is a good general-purpose approach for having full control over how the data is read, which is to specify an already-available dataframe via your BatchKwargs.
So, in your case, you could do the following:
my_dataset = pd.read_pickle(filename)
batch_kwargs = {"dataset": my_dataset}
batch = context.get_batch("my_datasource/in_memory_generator/my_dataset", "warning", batch_kwargs)
Note: this is for the 0.8.x series API, and assumes a data context configuration like the following:
datasources:
my_datasource:
class_name: PandasDatasource
...
generators:
in_memory_generator:
class_name: InMemoryGenerator
PS - This purpose is the primary reason for the existence of the InMemoryGenerator
.
EDIT
In Great Expectations >= 0.9.0, the API for get_batch has been simplified, so you would no longer need a generator at all in this case, and the datasource name is specified in the batch kwargs. The analogous code snippet looks like this:
context = DataContext()
my_dataset = pd.read_pickle(filename)
batch_kwargs = {"datasource": "my_datasource", "dataset": my_dataset}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="warning")
(and no generator is needed)