I am attempting to put together a generic matching process for financial data. The goal is to take one set of data with larger transactions and match it to a set of data with smaller transactions. Some are one to many, others are one to one. There are a few times where it may be reversed and part of the approach is to feed back the miss matches in inverse order to capture those possible matches.
I have three different modules I have created to iterate across each other to complete the work, but I am not getting consistent results. I see possible matches in my data that should be picked up but are not.
There is no clear matching criteria either, so the assumption is if I put the datasets in date order, and look for matching values, I want to take the first match since it should be closer to the same timeframe.
I am using Pandas and Itertools, but maybe not in the ideal format. Any help to get consistent matches would be appreciated.
Data examples:
Large Transaction Size:
AID AIssue Date AAmount
1508 3/14/2018 -560
1506 3/27/2018 -35
1500 4/25/2018 5000
Small Transaction Size:
BID BIssue Date BAmount
1063 3/6/2018 -300
1062 3/6/2018 -260
839 3/22/2018 -35
423 4/24/2018 5000
Expected Results
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
1500 4/25/2018 5000 423 4/24/2018 5000
but I usually get
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
with the 5000 not matching. And this is one example, but positive negative does not appear to be the factor when looking at the larger data set.
When reviewing the unmatched results from each, I see at least one $5000 transaction I would expect to be a 1-1 match and it is not in the results.
def matches(iterable):
s = list(iterable)
#Only going to 5 matches to avoid memory overrun on large datasets
s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5)))
return [list(elem) for elem in s]
def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
S = dfS[dfS.columns[dID]].values.tolist()
S_amount = dfS[dfS.columns[dVal]].values.tolist()
S = matches(S)
S_amount = matches(S_amount)
#get ID of first large record, the ID to be matched in this module
L = dfL[dfL.columns[dID]].iloc[0]
#get Value of first large record, this value will be matching criteria
L_amount = dfL[dfL.columns[dVal]].iloc[0]
count_of_sets = len(S)
for a in range(0,count_of_sets):
list_of_items = S[a]
list_of_values = S_amount[a]
if round(sum(list_of_values),2) == round(L_amount,2):
break
if round(sum(list_of_values),2) == round(L_amount,2):
retVal = list_of_items
else:
retVal = [-1]
return retVal
def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
#returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]
dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()
end_row = len(dfLarge.columns[dID]) - 1
matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))
for lg in range(0,end_row):
sm_match_id = one_to_many(dfLarge, dfSmall)
lg_match_id = dfLarge[dfLarge.columns[dID]][lg]
if sm_match_id != [-1]:
end_of_matches = len(sm_match_id)
for sm in range(0, end_of_matches):
if sm == 0:
sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
else:
sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()
sm_match['Match'] = lg
lg_match['Match'] = lg
sm_match.set_index('Match', inplace=True)
lg_match.set_index('Match', inplace=True)
matches = lg_match.join(sm_match, how='left')
matches_master = matches_master.append(matches)
dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()
return [matches_master, dfLarge, dfSmall]
in my module iterate_one_to_many, I was counting my row length incorrectly. I needed to replace
end_row = len(dfLarge.columns[dID]) - 1
with
end_row = len(dfLarge.index)