This is a weird one: I have 3 dataframes, "prov_data" which contains a provider id and counts on regions and categories (i.e. how many times that provider interacted with those regions and categories).
prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
'prov_cat_4':[0,3,0,0]})
"tender_data" which contains the same but for tenders.
tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
'ten_cat_4':[0,0,1]})
And finally a "no_match" DF which contains forbidden matches between provider and tender.
no_match = DataFrame({ 'prov_id':[1122,3344,5566],
'tender_id':['AA12','BB33','CC45']})
I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (i.e. the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.
So that would give me this DF:
df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
'ten_cat_4':[0,0,0]})
code
# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]
# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1
# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index
fp = a.get_level_values(0)
ft = a.get_level_values(1)
pd.concat([
prov_data.set_index('aprov_id').loc[fp].reset_index(),
tender_data.set_index('atender_id').loc[ft].reset_index()
], axis=1)
index prov_cat_1 prov_cat_2 prov_cat_3 prov_cat_4 prov_region_1 \
0 1122 0 1 0 0 0
1 3344 2 0 0 3 0
2 7788 0 0 4 0 0
prov_region_2 prov_region_3 atender_id ten_cat_1 ten_cat_2 ten_cat_3 \
0 2 0 BB33 0 1 1
1 0 1 AA12 1 0 0
2 0 1 BB33 0 1 1
ten_cat_4 ten_region_1 ten_region_2 ten_region_3
0 0 0 1 1
1 0 0 0 1
2 0 0 1 1
explanation