Search code examples
pandasdataframedynamicpandasql

Dynamic Query for Dataframe with multiple conditions and types


CONTEXT

I have a DataFrame I'm trying to filter using multiple different conditions / "rulesets". There are 3 different rules with a set of conditions as follows:

Rule_DF    Variable    Operator    Value
Rule1      Gender      ==          M
Rule1      DOB         >           01/01/2000
Rule1      Height      >           72    <-- "inches"
Rule2      Gender      ==          F
Rule2      DOB         >=          06/10/1998
Rule2      Height      <           70
etc...

EXAMPLE DATA

Name    Gender    DOB          Height(in)
Bob     M         01/01/2001   74
Jane    F         01/01/1999   64
etc...

EXPECTED RESULT

rule1_df should contain a list of all the names of people that fit rule 1's conditions. rule2_df should contain a list of all the names of people that fit rule 2's conditions.

CURRENT CODE

I am at the point where the rules are all grouped by the rule name and the query is written as such:

for item in grouped_rule_df:
    query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
    query = "Gender == 'M' & Height > 72 & DOB > '01/01/2000'"
    test = df.query(query)

PROBLEM

The query still returns a list of all the heights. If I run the query with just the height condition, it returns the correct list of people. Can you not combine data types in pandas .query? Or is PandaSQL/another package a better solution here?


Solution

  • Assuming rule_df the rules DataFrame and df the data one:

    # first ensure the correct column names are used
    # and that dates are in a YYYY-MM-DD format
    tmp = (df.rename(columns={'Height(in)': 'Height'})
             .assign(DOB=lambda d: pd.to_datetime(d['DOB']).dt.strftime('%Y-%m-%d'))
          )
    
    for rule, g in rule_df.groupby('Rule_DF'):
        # convert dates to YYYY-MM-DD
        # keep numbers as numeric, quote strings
        date = (pd.to_datetime(g['Value'], errors='coerce')
                  .dt.strftime('%Y-%m-%d')
                )
        value = (pd.to_numeric(g['Value'], errors='coerce')
                   .fillna('"'+date.fillna(g['Value'])+'"')
                   .astype(str)
                )
    
        # craft query string and select
        query = ' & '.join('('+g['Variable']+g['Operator']+value+')')
    
        out[rule] = tmp.query(query)
    

    Output:

    {'Rule1':    Name Gender         DOB  Height
              0   Bob      M  2001-01-01      74,
     'Rule2':    Name Gender         DOB  Height
              1  Jane      F  1999-01-01      64,
    }
    

    Queries:

    # Rule1
    (Gender=="M") & (DOB>"2000-01-01") & (Height>72.0)
    
    # Rule 2
    (Gender=="F") & (DOB>="1998-06-10") & (Height<70.0)