Search code examples
numpyduplicatesunique

Remove duplicates with additional requirements


I have three columns (x,y,m), where x and y are coordinates and m is the measurement. There are some duplicates, which are defined to be same (x,y). Among those duplicates, I then rank them by the measurement m, I only pick one of the duplicates with minimum m. Here is an example:

x = np.array([1,1,2,2,1,1,2])
y = np.array([1,2,1,2,1,1,1])
m = np.array([10,2,13,4,6,15,7])

there are three duplicates with same coordinates (1,1), among the three, the minimum m is 6. There are two duplicates with same coordinates (2,1), among the two, the minimum m is 7. So the final result I want is:

x = np.array([1,2,1,2])
y = np.array([2,2,1,1])
m = np.array([2,4,6,7])

The numpy.unique can not handle such situation. Any great thoughts?


Solution

  • We could use pandas here for a cleaner solution -

    import pandas as pd
    
    In [43]: df = pd.DataFrame({'x':x,'y':y,'m':m})
    
    In [46]: out_df = df.iloc[df.groupby(['x','y'])['m'].idxmin()]
    
    # Format #1 : Final output as a 2D array
    In [47]: out_df.values
    Out[47]: 
    array([[1, 1, 6],
           [1, 2, 2],
           [2, 1, 7],
           [2, 2, 4]])
    
    # Format #2 : Final output as three separate 1D arrays
    In [50]: X,Y,M = out_df.values.T
    
    In [51]: X
    Out[51]: array([1, 1, 2, 2])
    
    In [52]: Y
    Out[52]: array([1, 2, 1, 2])
    
    In [53]: M
    Out[53]: array([6, 2, 7, 4])