I am trying to perform a daily reconciliation which will check a master dataset in a pandas dataframe, for the existence various combinations in another dataset (the reconciliation key), which contains wildcards.
Here is an illustrative representation of the issue (not using real information from my rec):
Master Dataset
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | Norris | Firefighter | Female |
Angela | Thompson | Teacher | Female |
Ben | Johnson | Police Officer | Male |
Ben | Peterson | Solicitor | Male |
Charlie | Davies | Nurse | Male |
Debbie | Smith | Lawyer | Female |
Reconciliation Key
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | * | * | * |
Ben | Johnson | * | Male |
Debbie | * | * | Female |
So what I am trying to do here is find in the master dataset:
I cannot work out how to do this task... I would normally create a unique key with the combinations of FirstName, LastName, Occupation and Gender, and then merge the two datasets, but with wildcards, this isn't working for me.
I also tried a cartesian product (which I could later filter), but the size of the dataset I am working with resulted in a memory issue, so that wasn't working either.
Ideally, the output of the check would give this table (i.e. showing rows in the master dataset that meet the criteria in the reconciliation key):
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | Norris | Firefighter | Female |
Angela | Thompson | Teacher | Female |
Ben | Johnson | Police Officer | Male |
Debbie | Smith | Lawyer | Female |
It seems as though I was able to resolve this by defining a function that checked whether the row contained LastName = Wildcard or LastName = LastName, and Occupation = Wildcard or Occupation = Occupation.
This is my code:
Setup:
import numpy as np
import pandas as pd
master_df = pd.DataFrame({
'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})
recon_df = pd.DataFrame({
'FirstName': ['Angela', 'Ben', 'Debbie'],
'LastName': ['*', 'J', '*'],
'Occupation': ['*', '*', '*'],
'Gender': ['*', 'Male', 'Female'],
})
Solution:
def matches_wildcard(row):
recon_row = recon_df.loc[(recon_df['FirstName'] == row['FirstName']) &
((recon_df['LastName'] == '*') | (recon_df['LastName'] == row['LastName'])) &
((recon_df['Occupation'] == '*') | (recon_df['Occupation'] == row['Occupation'])) &
((recon_df['Gender'] == '*') | (recon_df['Gender'] == row['Gender']))]
return not recon_row.empty
matched_df = master_df[master_df.apply(matches_wildcard, axis=1)]
print(matched_df)
Output:
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | N | Firefighter | Female |
Ben | J | Police | Male |
Debbie | S | Lawyer | Female |