I want to use the great expectations package to validate that a column in a .csv file only contains integers.
The file I am using has only integers in the age column except for one row which has a '`' character instead. This is what I want the expectation to catch. I have also checked the .csv file in a text editor and can confirm that the ages in the age column are not enclosed in quotes.
However, the expectation fails one 100% of the data. I think it is because pandas is reading the column in as the object type (so a string) because of the one incorrect row. I can preprocess this using something like .astype(int)
becauseit will fail on that row. And wrapping .astype(int)
in a try
block would completely defeat the purpose of using great expectations for this.
Here is a minimal working example:
good.csv:
age,name
34,Fred
22,Bob
54,Mary
bad.csv:
age,name
34,Fred
`,Bob
54,Mary
Code:
import great_expectations as ge
df = ge.read_csv("./good.csv");
my_df.expect_column_values_to_be_of_type('age','int')
df = ge.read_csv("./bad.csv");
my_df.expect_column_values_to_be_of_type('age','int')
The first case returns
{'success': True,
'result': {'element_count': 3,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []}}
So all the ages are ints and it succeeds on every row. I expect the second case to fail, but only on the second row. However it fails on all the rows:
{'success': False,
'result': {'element_count': 3,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 3,
'unexpected_percent': 1.0,
'unexpected_percent_nonmissing': 1.0,
'partial_unexpected_list': ['34', '`', '54']}}
So I guess I expect something like this:
{'success': False,
'result': {'element_count': 3,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 1,
'unexpected_percent': 0.33,
'unexpected_percent_nonmissing': 1.0,
'partial_unexpected_list': ['`']}}
Is there something I am doing wrong, or is the package just not capable of this?
As a workaround until the new expect_column_values_to_be_parseasble_as_type
is implemented, I can achieve the same result using a regex expectation instead:
my_df = ge.read_csv("bad.csv")
pattern = r'^\d+$'
result = my_df.expect_column_values_to_match_regex('age',
pattern,
result_format={'result_format': 'COMPLETE'})
result
# In my case I'm only interested in where it went wrong
# print(result['result']['unexpected_list'])
# print(result['result']['unexpected_index_list'])
which gives:
{'success': False,
'result': {'element_count': 3,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 1,
'unexpected_percent': 0.3333333333333333,
'unexpected_percent_nonmissing': 0.3333333333333333,
'partial_unexpected_list': ['`'],
'partial_unexpected_index_list': [1],
'partial_unexpected_counts': [{'value': '`', 'count': 1}],
'unexpected_list': ['`'],
'unexpected_index_list': [1]}}
Note if you want to allow a leading + or - you need to change the pattern to:
pattern = r'^[+-]?\d+$'