Search code examples
pythonpython-3.xpandasdataframe

Trying to merge DataFrames with many conditions


This is a weird one: I have 3 dataframes, "prov_data" which contains a provider id and counts on regions and categories (i.e. how many times that provider interacted with those regions and categories).

prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
                  'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
                   'prov_cat_4':[0,3,0,0]})

enter image description here

"tender_data" which contains the same but for tenders.

tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
                     'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
                  'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
                     'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
                   'ten_cat_4':[0,0,1]})

enter image description here

And finally a "no_match" DF which contains forbidden matches between provider and tender.

no_match = DataFrame({ 'prov_id':[1122,3344,5566], 
            'tender_id':['AA12','BB33','CC45']})

enter image description here

I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (i.e. the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.

So that would give me this DF:

df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
                  'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
                   'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
                     'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
                  'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
                     'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
                   'ten_cat_4':[0,0,0]})

Solution

  • code

    # the first columns of each dataframe are the ids
    # i'm going to use them several times
    tid = tender_data.values[:, 0]
    pid = prov_data.values[:, 0]
    # first columns [1, 2, 3, 4] are cat columns
    # we could have used filter, but this is good
    # for this example
    pc = prov_data.values[:, 1:5]
    tc = tender_data.values[:, 1:5]
    # columns [5, 6, 7] are rgn columns
    pr = prov_data.values[:, 5:]
    tr = tender_data.values[:, 5:]
    
    # I want to mave this an m x n array, where
    # m = number of rows in prov df and n = rows in tender
    nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
    nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
    nm = ~nm.fillna(0).astype(bool).values * 1
    
    # the dot products of the cat arrays gets a handy
    # array where there are > 1 co-positive values
    # this combined with the a no_match construct
    a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
    a = a.mask(~a).stack().index
    
    fp = a.get_level_values(0)
    ft = a.get_level_values(1)
    

    pd.concat([
            prov_data.set_index('aprov_id').loc[fp].reset_index(),
            tender_data.set_index('atender_id').loc[ft].reset_index()
        ], axis=1)
    
    
       index  prov_cat_1  prov_cat_2  prov_cat_3  prov_cat_4  prov_region_1  \
    0   1122           0           1           0           0              0   
    1   3344           2           0           0           3              0   
    2   7788           0           0           4           0              0   
    
       prov_region_2  prov_region_3 atender_id  ten_cat_1  ten_cat_2  ten_cat_3  \
    0              2              0       BB33          0          1          1   
    1              0              1       AA12          1          0          0   
    2              0              1       BB33          0          1          1   
    
       ten_cat_4  ten_region_1  ten_region_2  ten_region_3  
    0          0             0             1             1  
    1          0             0             0             1  
    2          0             0             1             1  
    

    explanation

    • use dot products to determine matches
    • many other things I'll try to explain more later