Search code examples
pythonpandasdataframeboolean-indexing

Pandas dataframe data validation methods


Currently using a dataframe to store information on data we've collected. Prior to submitting the data, we need to validate the data based off a list of rules. Trying to set up these validations in python, and part of the problem is readability vs. performance.

For example: (Each item is a column in the data frame) Rule: Dir_Through_Lanes must exist where F_System = 1 and Facility_Type in (1;2;6) and RoadEventCollectionMethods Pav_Rep_Method = 2

Two methods of validating the data:

  1. Using an apply
def sjpm201a(self):
        def sjpm201_check(row):
            print(row)
            if row['F_SYSTEM'] == 1 and row['FACILITY_TYPE'] in [1,2,6] and row['PAVE_REP_METHOD'] == 2:
                if row['DIR_THROUGH_LANES'] == np.nan:
                    row['SJPM201'] = False
                    return row
            row['SJPM201a'] = True
            return row
        self.df = self.df.apply(sjpm201_check, axis=1)
  1. Using boolean indexing
    def sjpm201b(self):
        df = self.df
        self.df['SJPM201b'] = ((df['DIR_THROUGH_LANES'].notna()) | ((df['F_SYSTEM'] != 1) | (~df['FACILITY_TYPE'].isin([1,2,6])) | (df['PAVE_REP_METHOD'] != 2)))

The problem is that the first method is easier to read/write/update whereas the second method is a bit harder just due to the grouping of ands/ors especially on longer rules. However the second method runs in about .01 seconds vs the first method taking about 22 seconds to run. This is a problem as we have hundreds of rules that need to be ran.

Is there a way to achieve the same performance as the second method, but using if/else if blocks to filter the data?


Solution

  • Instead of checking for (!condition1 or !condition2) in your second example, you could use (condition1 and condition2) as in the first example in separated lines. E.g.:

    def sjpm201b(self):
        self.df = self.df[self.df['DIR_THROUGH_LANES'].notna()]
        self.df = self.df[self.df['F_SYSTEM'] == 1]
        self.df = self.df[self.df['FACILITY_TYPE'].isin([1,2,6])]
        self.df = self.df[self.df['PAVE_REP_METHOD'] == 2]
    

    Edit: I am slightly confused. Do you want to retain rows where df['F_SYSTEM']==1 or do you only want to keep everything that is != 1? The above function would only retain values that are within your specification and delete the rest.

    Edit2: The following will return a dataframe that will list which rows conform to your specifications but have na-values in the DIR_THROUGH_LANES column.

    def sjpm201b(self):
        df = self.df
        df = df[df['F_SYSTEM'] == 1]
        df = df[df['FACILITY_TYPE'].isin([1,2,6])]
        df = df[df['PAVE_REP_METHOD'] == 2]
        df = df[df['DIR_THROUGH_LANES'].isna()]
        return df
    

    In case you want a simple True/False result of your validation you could use:

    return df['DIR_THROUGH_LANES'].isna().any()
    

    Edit3: Improve readability by adding linebreaks:

    def sjpm201b(self):
        self.df = self.df[
            df['DIR_THROUGH_LANES'].notna()
            | (df['F_SYSTEM'] != 1)
            | (~df['FACILITY_TYPE'].isin([1,2,6]))
            | (df['PAVE_REP_METHOD'] != 2))
        ]