Search code examples
pythonpandasdataframematchingknn

Pairing control group uniquely to test group using KNN in Python


I want to find unique pairs for a test group, meaning each individual in the control group should only be chosen once. I have Gender, Age, and Education available to match them. I segmented the groups for Gender and Education since they are binary categories. Afterward, I want to find the best match in Age to a certain test individual - therefore the KNN approach with 1 nearest neighbor. The dummyData I'm using is available here.

The following part is the initialization and the segmentation:

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors

TestGroup = pd.read_csv('KNN_DummyData1.csv', names = ['Gender', 'Age', 'Education'])
ControlGroup = pd.read_csv('KNN_DummyData2.csv', names = ['Gender', 'Age', 'Education'])

#### Split TestGroup and ControlGroup into males and females, high and low education
Males_highEd = TestGroup.loc[(TestGroup['Gender'] == 1) & (TestGroup['Education'] == 1)]
Males_highEd.reset_index(drop=True, inplace=True)
Males_highEd.drop(columns=['Gender', 'Education'], inplace=True)

Males_Ctrl_highEd = ControlGroup.loc[(ControlGroup['Gender'] == 1) & (ControlGroup['Education'] == 1)]
Males_Ctrl_highEd.reset_index(drop=True, inplace=True)
Males_Ctrl_highEd.drop(columns=['Gender', 'Education'], inplace=True)

This part is the actual pairing where I fit on the control group and fill an empty DataFrame with values from the control group. After one control is matched I attempt to remove it from the original DataFrame (Males_Ctrl_highEd)

Matched_Males_Ctrl_highEd = pd.DataFrame().reindex_like(Males_highEd)
nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(Males_Ctrl_highEd)

for i in range(len(Males_highEd)):
    distances, indices = nbrs.kneighbors(Males_highEd[i:i+1])
    Matched_Males_Ctrl_highEd.loc[0].iat[i] = Males_Ctrl_highEd.loc[indices[0]]
    print(f"{i} controls of {len(Males_highEd)} tests found")
    Males_Ctrl_highEd = Males_Ctrl_highEd.drop(labels=indices[0], axis=0)

At the moment I am getting the following error for line 6:

ValueError: setting an array element with a sequence.

I have tried various approaches for how to assign a control into the matched control group, but I can't seem to succeed in copying an individual from the original DataFrame into the empty one.

If it is any help, I did a working implementation in MatLab (but need to have it in Python as well):

ControlGroup = Data;
Idx = NaN(length(Data),1);
for i=1:length(Data)
   Idx(i,1) = knnsearch(Data2,Data(i,:),'distance','seuclidean');
   ControlGroup(i,:) = Data2(Idx(i),:);
   Data2(Idx(i),:) = [];
end

If you have any ideas or comments about a different implementation that can do the same, I'm all ears.


Solution

  • I ended up using only age in the KNN matching (and manually matching on the binary features), doing the following solution:

    neeededNeighbors = max(TestGroup["Age"].value_counts())+1
    nn = NearestNeighbor(n_neighbors = neededNeighbors, algorithm="ball_tree", metric = "euclidian").fit(ControlGroup["Age"].to_numpy().reshape(-1,1))
    TestGroup.sort_values(by="Age"),inplace=True)
    distances, indices = nn.kneighbors(TestGroup["Age"].to_numpy().reshape(-1,1))
    
    min_age = min(TestGroup["Age"])
    max_age = max(TestGroup["Age"])
    ages = list(range(min_year,max_year+1))
    idx = pd.DataFrame(np.unique(indices,axis=0),index = ages)
    cntr = pd.DataFrame(index=ages,colums=["cntrs"])
    cntr["cntrs"] = 0
    
    matchedControlGroup = pd.DataFrame().reindex_like(TestGroup)
    matchedID = pd.DataFrame(np.full_like(np.arrange(len(matchedControlGroup)), np.nan, dtype=np.double))
    
    for i in range(len(TestGroup)):
        if TestGroup["Age"].loc[i] in cntr.index:
        x = TestGroup["Age"].loc[i]
        matchedControlGroup.loc[i] = ControlGroup.loc[idx.loc[x][cntr.loc[x][0]]]
        cntr.loc[i] += 1
        matchedID.loc[i] = TestGroup["ID"].loc[i]
    
    matchedID["ID_Match"] = matchedID
    
    

    That way I make a reference to how many of each age group is needed and iterate over each age group to get the next best match to the individual. This means the first in each age group will get the better matches and depending on the number of available controls, there might be an overlap.

    I also did an implementation where this does not happen - however, I could not find a way in which I did not need to refit the KNN each time a match was found, which made the implementation very slow.