Search code examples
pythonpandasvalidationtypesgreat-expectations

Great Expectations expect column to contain only integers fails for all rows when only one is bad


I want to use the great expectations package to validate that a column in a .csv file only contains integers.

The file I am using has only integers in the age column except for one row which has a '`' character instead. This is what I want the expectation to catch. I have also checked the .csv file in a text editor and can confirm that the ages in the age column are not enclosed in quotes.

However, the expectation fails one 100% of the data. I think it is because pandas is reading the column in as the object type (so a string) because of the one incorrect row. I can preprocess this using something like .astype(int) becauseit will fail on that row. And wrapping .astype(int) in a try block would completely defeat the purpose of using great expectations for this.

Here is a minimal working example:

good.csv:

age,name
34,Fred
22,Bob
54,Mary

bad.csv:

age,name
34,Fred
`,Bob
54,Mary

Code:

import great_expectations as ge

df = ge.read_csv("./good.csv");
my_df.expect_column_values_to_be_of_type('age','int')

df = ge.read_csv("./bad.csv");
my_df.expect_column_values_to_be_of_type('age','int')

The first case returns

{'success': True,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

So all the ages are ints and it succeeds on every row. I expect the second case to fail, but only on the second row. However it fails on all the rows:

{'success': False,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 3,
  'unexpected_percent': 1.0,
  'unexpected_percent_nonmissing': 1.0,
  'partial_unexpected_list': ['34', '`', '54']}}

So I guess I expect something like this:

{'success': False,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 1,
  'unexpected_percent': 0.33,
  'unexpected_percent_nonmissing': 1.0,
  'partial_unexpected_list': ['`']}}

Is there something I am doing wrong, or is the package just not capable of this?


Solution

  • As a workaround until the new expect_column_values_to_be_parseasble_as_type is implemented, I can achieve the same result using a regex expectation instead:

    my_df = ge.read_csv("bad.csv")
    pattern = r'^\d+$'
    result  = my_df.expect_column_values_to_match_regex('age',
                                                        pattern,
                                                        result_format={'result_format': 'COMPLETE'})
    
    result
    # In my case I'm only interested in where it went wrong
    # print(result['result']['unexpected_list'])
    # print(result['result']['unexpected_index_list'])
    

    which gives:

    {'success': False,
     'result': {'element_count': 3,
      'missing_count': 0,
      'missing_percent': 0.0,
      'unexpected_count': 1,
      'unexpected_percent': 0.3333333333333333,
      'unexpected_percent_nonmissing': 0.3333333333333333,
      'partial_unexpected_list': ['`'],
      'partial_unexpected_index_list': [1],
      'partial_unexpected_counts': [{'value': '`', 'count': 1}],
      'unexpected_list': ['`'],
      'unexpected_index_list': [1]}}
    

    Note if you want to allow a leading + or - you need to change the pattern to:

    pattern = r'^[+-]?\d+$'