CONTEXT
I have a DataFrame I'm trying to filter using multiple different conditions / "rulesets". There are 3 different rules with a set of conditions as follows:
Rule_DF Variable Operator Value
Rule1 Gender == M
Rule1 DOB > 01/01/2000
Rule1 Height > 72 <-- "inches"
Rule2 Gender == F
Rule2 DOB >= 06/10/1998
Rule2 Height < 70
etc...
EXAMPLE DATA
Name Gender DOB Height(in)
Bob M 01/01/2001 74
Jane F 01/01/1999 64
etc...
EXPECTED RESULT
rule1_df should contain a list of all the names of people that fit rule 1's conditions. rule2_df should contain a list of all the names of people that fit rule 2's conditions.
CURRENT CODE
I am at the point where the rules are all grouped by the rule name and the query is written as such:
for item in grouped_rule_df:
query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
query = "Gender == 'M' & Height > 72 & DOB > '01/01/2000'"
test = df.query(query)
PROBLEM
The query still returns a list of all the heights. If I run the query with just the height condition, it returns the correct list of people. Can you not combine data types in pandas .query? Or is PandaSQL/another package a better solution here?
Assuming rule_df
the rules DataFrame and df
the data one:
# first ensure the correct column names are used
# and that dates are in a YYYY-MM-DD format
tmp = (df.rename(columns={'Height(in)': 'Height'})
.assign(DOB=lambda d: pd.to_datetime(d['DOB']).dt.strftime('%Y-%m-%d'))
)
for rule, g in rule_df.groupby('Rule_DF'):
# convert dates to YYYY-MM-DD
# keep numbers as numeric, quote strings
date = (pd.to_datetime(g['Value'], errors='coerce')
.dt.strftime('%Y-%m-%d')
)
value = (pd.to_numeric(g['Value'], errors='coerce')
.fillna('"'+date.fillna(g['Value'])+'"')
.astype(str)
)
# craft query string and select
query = ' & '.join('('+g['Variable']+g['Operator']+value+')')
out[rule] = tmp.query(query)
Output:
{'Rule1': Name Gender DOB Height
0 Bob M 2001-01-01 74,
'Rule2': Name Gender DOB Height
1 Jane F 1999-01-01 64,
}
Queries:
# Rule1
(Gender=="M") & (DOB>"2000-01-01") & (Height>72.0)
# Rule 2
(Gender=="F") & (DOB>="1998-06-10") & (Height<70.0)