I'm writing an algorithm to match each person from setA with someone from setB, based on interest similarity, using NearestNeighbors(n_neighbors = 1).
This is what I have so far:
dfA = pd.DataFrame(np.array([[1, 1, 1, 1], [1,1,2,2], [4, 5, 2, 0], [8, 8, 8, 8]]),
columns=['interest0', 'interest2', 'interest3','interest4'],
index=['personA0','personA1','personA2','personA3'])
dfB = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, 1, 2], [2,3,2,2], [8, 6, 8, 8]]),
columns=['interest0', 'interest2', 'interest3','interest4'],
index=['personB0','personB1','personB2','personB3'])
knn = NearestNeighbors(n_neighbors = 1, metric = my_dist).fit(dfA)
distances, indices = knn.kneighbors(dfB)
>>> dfA
drink interest2 interest3 interest4
personA0 1 1 1 1
personA1 1 1 2 2
personA2 4 5 2 0
personA3 8 8 8 8
>>> dfB
drink interest2 interest3 interest4
personB0 1 1 1 1
personB1 1 1 1 2
personB2 2 3 2 2
personB3 8 6 8 8
>>> print("Distances\n\n", distances, "\n\nIndices\n\n", indices)
Distances
[[0. ]
[0.125]
[1.125]
[0.5 ]]
Indices
[[0]
[0]
[1]
[3]]
Looking at the output, it suggests personB0's top match is personA0 (distance = 0). However, personB1's top match is also personA0(distance = 0.125)!
I want to somehow match personB0 with personA0 (as their distances are smallest), move them to another table, then re-run the K-Neighbors algorithm, which will hopefully now suggest personB1's top match is personA1 (as A0 is now removed). I've started to write a for loop to iterate through this, however, it's quite complicated for me (having to iterate through multiple different arrays, dataframes etc) so I'm wondering what is the best way? I want a final dataframe like below, which has 1:1 correspondence:
SetA SetB
personA0 personB0
personA1 personB1
personA2 personB3
personA3 personB2
You could use a list to check whether a person has been matched or not. Besides, you need to get a list of neighbours ordered by their distance rather than the nearest neighbour by changing tha value passed to parameter n_neighbors
.
knn = NearestNeighbors(n_neighbors=len(dfB)).fit(dfB)
distances, indices = knn.kneighbors(dfA)
matched = []
pairs = []
for indexA, candidatesB in enumerate(indices):
personA = dfA.index[indexA]
for indexB in candidatesB:
if indexB not in matched:
matched.append(indexB)
personB = dfB.index[indexB]
pairs.append([personA, personB])
break
matches = pd.DataFrame(pairs, columns=['SetA', 'SetB'])
The resulting dataframe looks like this:
SetA SetB
0 personA0 personB0
1 personA1 personB1
2 personA2 personB2
3 personA3 personB3
Please notice that I have used the default metric (minkowski with p=2). Results may vary if you pass metric=my_dist
to NearestNeighbors
.