Search code examples
pythonalgorithmcomputational-geometrydimensionality-reductionmulti-dimensional-scaling

Choosing subset of farthest points in given set of points


Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.

If n is approximately 16 million and k is about 300, how do we efficiently do this?

My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.


Solution

  • This is known as the discrete p-dispersion (maxmin) problem.

    The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).

    Factor-2 Approximation

    The factor-2 approximation is as follows:

    Let V be the set of nodes/objects.
    Let i and j be two nodes at maximum distance.
    Let p be the number of objects to choose.
    P = set([i,j])
    while size(P)<p:
      Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
      \That is: find the node with the greatest minimum distance to the set P.
      P = P.union(v)
    Output P
    

    You could implement this in Python like so:

    #!/usr/bin/env python3
    
    import numpy as np
    
    p = 50
    N = 400
    
    print("Building distance matrix...")
    d = np.random.rand(N,N) #Random matrix
    d = (d + d.T)/2         #Make the matrix symmetric
    
    print("Finding initial edge...")
    maxdist  = 0
    bestpair = ()
    for i in range(N):
      for j in range(i+1,N):
        if d[i,j]>maxdist:
          maxdist = d[i,j]
          bestpair = (i,j)
    
    P = set()
    P.add(bestpair[0])
    P.add(bestpair[1])
    
    print("Finding optimal set...")
    while len(P)<p:
      print("P size = {0}".format(len(P)))
      maxdist = 0
      vbest = None
      for v in range(N):
        if v in P:
          continue
        for vprime in P:
          if d[v,vprime]>maxdist:
            maxdist = d[v,vprime]
            vbest   = v
      P.add(vbest)
    
    print(P)
    

    Exact Solution

    You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:

    #!/usr/bin/env python
    import numpy as np
    import gurobipy as grb
    
    p = 50
    N = 400
    
    print("Building distance matrix...")
    d = np.random.rand(N,N) #Random matrix
    d = (d + d.T)/2             #Make the matrix symmetric
    
    m = grb.Model(name="MIP Model")
    
    used  = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
    
    objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
    
    m.addConstr(
      lhs=grb.quicksum(used),
      sense=grb.GRB.EQUAL,
      rhs=p
    )
    
    # for maximization
    m.ModelSense = grb.GRB.MAXIMIZE
    m.setObjective(objective)
    
    # m.Params.TimeLimit = 3*60
    
    # solving with Glpk
    ret = m.optimize()
    

    Scaling

    Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).

    The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.

    import numpy as np
    from scipy.spatial import ConvexHull
    from scipy.spatial.distance import cdist
    
    p = 300
    N = 16000000
    
    # Find a convex hull in O(N log N)
    points = np.random.rand(N, 3)   # N random points in 3-D
    
    # Returned 420 points in testing
    hull = ConvexHull(points)
    
    # Extract the points forming the hull
    hullpoints = points[hull.vertices,:]
    
    # Naive way of finding the best pair in O(H^2) time if H is number of points on
    # hull
    hdist = cdist(hullpoints, hullpoints, metric='euclidean')
    
    # Get the farthest apart points
    bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
    
    P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
    
    # Now we have a problem
    print("Finding optimal set...")
    while len(P)<p:
      print("P size = {0}".format(len(P)))
      distance_to_P        = cdist(points, P)
      minimum_to_each_of_P = np.min(distance_to_P, axis=1)
      best_new_point_idx   = np.argmax(minimum_to_each_of_P)
      best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
      P = np.append(P,best_new_point,axis=0)
    
    print(P)