Search code examples
pythonpandaspostgresqlgreat-expectations

Using Python Great Expectations to remove invalid data


I just started with Great Expectations library and I want to know if it is possible to use it to remove invalidated data from Pandas DataFrame. And how I can do that if is possible ? Also I want to insert invalid data to PostgreSQL database.

I didn't find anything about this in the documentation and on searching the Web.

Later Edit : To clarify: I need that in the case great expectation for example find 5 rows in a DataFrame that are invalid (for example df.expect_column_values_to_not_be_null('age') has 5 rows with null) to remove them from original DataFrame and insert them in a PostgreSQL errors table


Solution

  • Great Expectations is a powerful tool to validate data.
    Like all powerful tools, it's not that straightforward.

    You can start from here:

    import great_expectations as ge
    import numpy as np
    import pandas as pd
        
    # get some random numbers and create a pandas df
    df_raw = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
    
    # initialize a "great_expectations" df 
    df = ge.from_pandas(df_raw)
    
    # search for invalidate data on column 'A'. 
    # In this case, i'm looking for any null value from column 'A'.
    df.expect_column_values_to_not_be_null('A')
    

    Results:

    {
      "exception_info": null,
      "expectation_config": {
        "expectation_type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "column": "A",
          "result_format": "BASIC"
        },
        "meta": {}
      },
      "meta": {},
      "success": true,
      "result": {
        "element_count": 100,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": []
      }
    }
    

    Look at the response : good news !!!
    There aren't null values in my df. "unexpected_count" is equal to 0

    API Reference : https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/index.html


    EDIT: If you need simply to find some invalid values and split your df into:

    1. Clean Dataframe
    2. Dirty Dataframe

    maybe you dont need "great_expectations". you can use a function like this:

    import pandas as pd
    
    my_df = pd.DataFrame({'A': [1,2,1,2,3,0,1,1,5,2]})
    
    def check_data_quality(dataframe):
        df = dataframe
        clean_df = df[df['A'].isin([1, 2])]
        dirty_df = df[df["A"].isin([1, 2]) == False]
        return {'clean': clean_df, 
                'dirty': dirty_df}
    
    my_df_clean = check_data_quality(my_df)['clean']
    my_df_dirty = check_data_quality(my_df)['dirty']