There are 2 Dataframes and one hierarchical index (pandas multiIndex). Dataframe A has a list of ID's and Names. Dataframe B has a list of Name combinations and a similarity score.
I want, based on the multiindex to pick up the values from Dataframe A and check if the combination does exist in DataFrame B. If yes, I want to bring to my Multindex dataframe the similiarity score, otherwise just 0.
DataFrame A (Original Dataframe)
test= pd.DataFrame({'row':['a','b','c','d'],'col_A' : ["Alexis","Alexi","Peter","Pete"]})
test = test.set_index('row');test
Out:
row col_A
a Alexis
b Alexi
c Peter
d Pete
DataFrame B (Name Similarity)
names = pd.DataFrame({'A' : ["Alexis","Alexi","Peter","Pete"]
,'B' : ["Alexi","Alexis","Pete","Peter"]
, "similarity" : [0.9,0.9,0.8,0.8]})
Out:
A B similarity
0 Alexis Alexi 0.9
1 Alexi Alexis 0.9
2 Peter Pete 0.8
3 Pete Peter 0.8
Multiindex
# Creating a Pandas MultiIndex
arrays = [['a', 'a', 'a', 'b', 'b', 'c'],
['b', 'c', 'd', 'c', 'd', 'd']]
tuples = list(zip(*arrays))
indexy = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
to iterate through the index I use the below function, however, I am not sure how to adjust it, in order to bring the similarity score when exists or 0 when it does not.
a = pd.DataFrame((test.reindex(indexy.get_level_values(0)).values (?) test.reindex(indexy.get_level_values(1))).values,index=indexy,columns=test.columns)
Where a I want to look like this:
row similarity
first second
a b 0.9
c 0
d 0
b c 0
d 0
c d 0.8
So if you are not too much into the multiindex, here is one way to get your data as you expect:
import pandas as pd
test= pd.DataFrame({'row':['a','b','c','d'],'col_A' : ["Alexis","Alexi","Peter","Pete"]})
names = pd.DataFrame({'A' : ["Alexis","Alexi","Peter","Pete"],
'B' : ["Alexi","Alexis","Pete","Peter"],
"similarity" : [0.9,0.9,0.8,0.8]})
Note I don't set_index
test
but you can do it, it would slightly change the following (see the comments). You can create a
dataframe a such as:
import itertools
a = pd.DataFrame([p for p in itertools.combinations(test['col_A'], 2)],columns =['A','B'],
index=['%s,%s'%p for p in itertools.combinations(test['row'], r=2)])
# here if you did set_index your test, then replace
# index=['%s,%s'%p for p in itertools.combinations(test['row'], r=2)] by
# index=['%s,%s'%p for p in itertools.combinations(test.index, r=2)]
And it's look like:
A B
a,b Alexis Alexi
a,c Alexis Peter
a,d Alexis Pete
b,c Alexi Peter
b,d Alexi Pete
c,d Peter Pete
Then you can use reset_index
(to get current index as a column but it depends of what you want exactly) merge
with names
on columns A and B, fill nan
with 0, drop the two columns A and B, and rename (if necessary):
a = a.reset_index().merge(names, how = 'left', on = ['A','B']).fillna(0).\
drop(labels = ['A','B'], axis=1).rename(columns = {'index':'row', 'similarity':'col_A'})
Let me know if you can do what you want after
EDIT: with the new output you look for, you can do:
a = pd.DataFrame([p for p in itertools.combinations(test['col_A'], 2)],columns =['A','B'],
index=pd.MultiIndex.from_tuples([p for p in itertools.combinations(test.index, r=2)], names=['first', 'second']))
Note: itertools
generate tuples that are used in pd.MultiIndex.from_tuples
to define your multiindex DF.
Now you can merge
(to keep the multiindex, you need to reset_index
before and set_index
after:
a = a.reset_index().merge(names, how = 'left', on = ['A','B']).fillna(0).\
drop(labels = ['A','B'], axis=1).set_index(['first', 'second'])