Search code examples
pythonpandasdistanceeuclidean-distancepropensity-score-matching

Calculating smallest within trio distance


I have a pandas dataframe similar to the one below:

Output  var1        var2        var3
1   0.487981    0.297929    0.214090    
1   0.945660    0.031666    0.022674
2   0.119845    0.828661    0.051495
2   0.095186    0.852232    0.052582
3   0.059520    0.053307    0.887173
3   0.091049    0.342226    0.566725
3   0.119295    0.414376    0.466329
... ... ... ... ...
    

Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance. The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure. I think that at this GitHub link: https://github.com/bwh-dope/pharmacoepi_toolbox/blob/master/src/org/drugepi/match/MatchDistanceCalculator.java there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else. My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space. I found a function that calculates the distance between 2 points:

#found here https://stackoverflow.com/questions/68938033/min-distance-between-point-cloud-xyz-points-in-python
import numpy as np

distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))

import itertools

def min_distance(cloud):
  pairs = itertools.combinations(cloud, 2)
  return np.min(map(lambda pair: distance(*pair), pairs))

def get_points(filename):
  with open(filename, 'r') as file:
    rows = np.genfromtxt(file, delimiter=',', skip_header=True)
  return rows


filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)

However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.


Solution

  • Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive. I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3. This is my distance function:

    def Euclidean_Dist(df1, df2):
        return np.linalg.norm(df1 - df2)
    

    My variables:

    tripletta_for = []
    tripletta_tot_wr = []
    
    p_inf = float('inf')
    
    counter = 1
    

    These are the steps used to computed the within-trio distance. Hope they are correct.

    '''
    i[0] = index
    i[1] = treatment prop1
    i[1][0] = treatment
    i[1][1] = prop
    '''
    #io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]
    
    for i in dataset1.iterrows():
        minimum_distance = p_inf
        print(counter)
        counter = counter + 1
        for j in dataset2.iterrows():
            dist12 = Euclidean_Dist(i[1][1], j[1][1])
            for k in dataset3.iterrows():
                dist13 = Euclidean_Dist(i[1][1], k[1][1])
                dist23 = Euclidean_Dist(j[1][1], k[1][1])
                somma = dist12 + dist13 + dist23
                if somma < minimum_distance:
                    minimum_distance = somma
                    tripletta_for = i[0], j[0], k[0]
                    #print(tripletta_for)
        dataset2.drop(index=tripletta_for[1], inplace=True)
        dataset3.drop(tripletta_for[2], inplace=True)
        #print(len(dataset3))
        tripletta_tot_wr.append(tripletta_for)
        #print(tripletta_tot_wr)