Search code examples
pythonpandasmachine-learningsimilarity

Return index after calculating distance metric


Given a DF with 4 features and 1 index column :

df = pd.DataFrame(np.random.randint(0,100, size= (100,4)), columns=list('ABCD'))
df['index'] = range(1, len(df) + 1)

I want to calculate the Manhattan distance given input from a user. The user's inputs will be represented by a,b,c,d. The function is defined below.

def Manhattan_d(a,b,c,d):

    return (a - df['A']) + (b -df['B']) + (c - df['C']) + (d - df['D'])

When the answer is returned to me, it comes out as a list. Now, I want to find the minimum value returned to me AND link it back to the index number from where its from.

If I do return(min(formula)), I get an output of one number and I can't locate it back to the index it was originally from. If it's easier, the index represents a category. So I need to find the category with the minimum output after the formula is applied.

Hope that's clear.


Solution

  • Perhaps a better approach is to apply Manhattan distance to each row of the dataframe. At that point, you can use .idxmin() to find the index of the point in the original dataframe which is most similar (has lowest Manhattan distance) to the point a,b,c,d that you fed the function.

    def Manhattan_d(a,b,c,d, df):
        return df.apply(lambda row:abs(row['A']-a)+abs(row['B']-b)+abs(row['C']-c)+abs(row['D']-d), axis=1).idxmin()
    

    Note: Manhattan distance requires the absolute value of the difference, which I have included.

    Another note: it is generally good practice to pass all variables into a function, which is why I included df as an input to your function.

    Another possibility is to use existing implementations, such as the DistanceMetric class from Scikit-learn.