Search code examples
pythonpandasgreat-expectations

Use a pickled pandas dataframe as a data asset in great_expectations


Probably a very simple question but I could not figure it out from the documentation of great_expectations. I would like to run some tests on a pandas dataframe that is stored locally as a pickled file '.pkl'.

When I ran great_expectations add-datasource it ignored the .pkl files and only created assets for .csv files. Reading csv files from pandas is slow, so it would be great if GE could support other formats like pickle and HDF.

How to load .pkl or .hdf files as GE's assets?

I'm using v0.8.7 :)


Solution

  • For pandas (and spark), there is a good general-purpose approach for having full control over how the data is read, which is to specify an already-available dataframe via your BatchKwargs.

    So, in your case, you could do the following:

    my_dataset = pd.read_pickle(filename)
    batch_kwargs = {"dataset": my_dataset}
    batch = context.get_batch("my_datasource/in_memory_generator/my_dataset", "warning", batch_kwargs)
    

    Note: this is for the 0.8.x series API, and assumes a data context configuration like the following:

    datasources:
      my_datasource:
        class_name: PandasDatasource
        ...
        generators:
          in_memory_generator:
            class_name: InMemoryGenerator
    

    PS - This purpose is the primary reason for the existence of the InMemoryGenerator.

    EDIT

    In Great Expectations >= 0.9.0, the API for get_batch has been simplified, so you would no longer need a generator at all in this case, and the datasource name is specified in the batch kwargs. The analogous code snippet looks like this:

    context = DataContext()
    my_dataset = pd.read_pickle(filename)
    batch_kwargs = {"datasource": "my_datasource", "dataset": my_dataset}
    batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="warning")
    

    (and no generator is needed)