Search code examples
pythonpandasdataframevalidationcustomvalidator

Python coulmns validation using pandas schema


I am trying to validate my DataFrame coulmns using PandasSchema.I am stuck at validating some columns such as columns like :

1.ip_address- should contain ip address in following format 1.1.1.1 or it should be null if any other value it should raise an error. 2.initial_date- format yyyy-mm-dd h:m:s or mm-dd-yyyy h:m:s etc. 3.customertype should be in['type1','type2','type3'] anything else raise an error. 4.customer satisfied= yes/no or blank 5.customerid should not be longer than 5 chracter eg- cus01,cus02 6.time should be in %:%: format or h:m:s format anything else raise an exception.

from pandas_schema import Column, Schema
def check_string(sr):
    try:
        str(sr)
    except InvalidOperation:
        return False
    return True
def check_datetime(self,dec):
        try:
            datetime.datetime.strptime(dec, self.date_format)
            return True
        except:
            return False
def check_int(num):
    try:
        int(num)
    except ValueError:
        return False
    return True
    

string_validation=[CustomElementValidation(lambda x: check_string(x).str.len()>5 ,'Field Correct')]
int_validation = [CustomElementValidation(lambda i: check_int(i), 'is not integer')]
contain_validation = [CustomElementValidation(lambda y: check_string(y) not in['type1','type2','type3'], 'Filed is correct')]
date_time_validation=[CustomElementValidation(lambda dt: check_datetime(dt).strptime('%m/%d/%Y %H:%M %p'),'is not a date
 time')]
null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]

schema = Schema([
                 Column('CompanyID', string_validation + null_validation),
                 Column('initialdate', date_time_validation),
                 Column('customertype', contain_validation),
                 Column('ip', string_validation),
                 Column('customersatisfied', string_validation)])
errors = schema.validate(combined_df)
errors_index_rows = [e.row for e in errors]
pd.DataFrame({'col':errors}).to_csv('errors.csv')

Solution

  • I've just looked at the documentation for PandasSchema and most if not all you are looking for it out of the box functionality. Take a look at:

    As a quick attempt to solve your issue, something along the lines of this should work:

    from pandas_schema.validation import (
        InListValidation
        ,IsDtypeValidation
        ,DateFormatValidation
        ,MatchesPatternValidation
    )
    
    schema = Schema([
        # Match a string of length between 1 and 5
        Column('CompanyID', [MatchesPatternValidation(r".{1,5}")]),
    
        # Match a date-like string of ISO 8601 format (https://www.iso.org/iso-8601-date-and-time-format.html)
        Column('initialdate', [DateFormatValidation("%Y-%m-%d %H:%M:%S")], allow_empty=True),
        
        # Match only strings in the following list
        Column('customertype', [InListValidation(["type1", "type2", "type3"])]),
    
        # Match an IP address RegEx (https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch07s16.html)
        Column('ip', [MatchesPatternValidation(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}")]),
    
        # Match only strings in the following list    
        Column('customersatisfied', [InListValidation(["yes", "no"])], allow_empty=True)
    ])