Search code examples
pandasindexinghierarchical

Vlook Up Elements from another dataframe for creating MultiIndex DataFrame in Python


There are 2 Dataframes and one hierarchical index (pandas multiIndex). Dataframe A has a list of ID's and Names. Dataframe B has a list of Name combinations and a similarity score.

I want, based on the multiindex to pick up the values from Dataframe A and check if the combination does exist in DataFrame B. If yes, I want to bring to my Multindex dataframe the similiarity score, otherwise just 0.

DataFrame A (Original Dataframe)

test= pd.DataFrame({'row':['a','b','c','d'],'col_A' : ["Alexis","Alexi","Peter","Pete"]})

test = test.set_index('row');test
Out:   
    row   col_A   
    a     Alexis
    b     Alexi
    c     Peter
    d     Pete

DataFrame B (Name Similarity)

names = pd.DataFrame({'A' : ["Alexis","Alexi","Peter","Pete"]
                    ,'B' : ["Alexi","Alexis","Pete","Peter"]
                    , "similarity" : [0.9,0.9,0.8,0.8]})
Out:
       A       B      similarity
0   Alexis   Alexi         0.9
1   Alexi    Alexis        0.9
2   Peter    Pete          0.8
3   Pete     Peter         0.8

Multiindex

# Creating a Pandas MultiIndex 
arrays = [['a', 'a', 'a', 'b', 'b', 'c'],
          ['b', 'c', 'd', 'c', 'd', 'd']]

tuples = list(zip(*arrays))
indexy = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

to iterate through the index I use the below function, however, I am not sure how to adjust it, in order to bring the similarity score when exists or 0 when it does not.

a = pd.DataFrame((test.reindex(indexy.get_level_values(0)).values (?) test.reindex(indexy.get_level_values(1))).values,index=indexy,columns=test.columns)

Where a I want to look like this:

   row      similarity
first second           
a     b             0.9
      c             0
      d             0
b     c             0
      d             0
c     d             0.8

Solution

  • So if you are not too much into the multiindex, here is one way to get your data as you expect:

    import pandas as pd
    test= pd.DataFrame({'row':['a','b','c','d'],'col_A' : ["Alexis","Alexi","Peter","Pete"]})
    names = pd.DataFrame({'A' : ["Alexis","Alexi","Peter","Pete"],
                        'B' : ["Alexi","Alexis","Pete","Peter"],
                        "similarity" : [0.9,0.9,0.8,0.8]})
    

    Note I don't set_index test but you can do it, it would slightly change the following (see the comments). You can create a dataframe a such as:

    import itertools
    a = pd.DataFrame([p for p in itertools.combinations(test['col_A'], 2)],columns =['A','B'],
                     index=['%s,%s'%p for p in itertools.combinations(test['row'], r=2)])
    # here if you did set_index your test, then replace 
    # index=['%s,%s'%p for p in itertools.combinations(test['row'], r=2)] by 
    # index=['%s,%s'%p for p in itertools.combinations(test.index, r=2)]
    

    And it's look like:

              A      B
    a,b  Alexis  Alexi
    a,c  Alexis  Peter
    a,d  Alexis   Pete
    b,c   Alexi  Peter
    b,d   Alexi   Pete
    c,d   Peter   Pete
    

    Then you can use reset_index (to get current index as a column but it depends of what you want exactly) merge with names on columns A and B, fill nan with 0, drop the two columns A and B, and rename (if necessary):

    a = a.reset_index().merge(names, how = 'left', on = ['A','B']).fillna(0).\
         drop(labels = ['A','B'], axis=1).rename(columns = {'index':'row', 'similarity':'col_A'})
    

    Let me know if you can do what you want after

    EDIT: with the new output you look for, you can do:

    a = pd.DataFrame([p for p in itertools.combinations(test['col_A'], 2)],columns =['A','B'],
                     index=pd.MultiIndex.from_tuples([p for p in itertools.combinations(test.index, r=2)], names=['first', 'second']))
    

    Note: itertools generate tuples that are used in pd.MultiIndex.from_tuples to define your multiindex DF.

    Now you can merge (to keep the multiindex, you need to reset_index before and set_index after:

    a = a.reset_index().merge(names, how = 'left', on = ['A','B']).fillna(0).\
             drop(labels = ['A','B'], axis=1).set_index(['first', 'second'])