Search code examples
pythonarraysnumpyselectionsampling

How to select numeric samples based on their distance relative to samples already selected (Python)


I have some random test data in a 2D array of shape (500,2) as such:

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])

From this array, I first select 10 random samples, to select the 11th sample, I would like to pick the sample that is the furthest away from the original 10 selected samples collectively, I am using the euclidean distance to do this. I need to keep doing this until a certain amount have been picked. Here is my attempt at doing this.

# Function to get the distance between samples
def get_dist(a, b):

    return np.sqrt(np.sum(np.square(a - b)))


# Set up variables and empty lists for the selected sample and starting samples
n_xy_to_select = 120
selected_xy = []
starting = []


# This selects 10 random samples and appends them to selected_xy
for i in range(10):
    idx = np.random.randint(len(xy))
    starting_10 = xy[idx, :]
    selected_xy.append(starting_10)
    starting.append(starting_10)
    xy = np.delete(xy, idx, axis = 0)
starting = np.asarray(starting)


# This performs the selection based on the distances
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Get the distance between each already selected sample, and every other unselected sample
        dists_ = np.array([get_dist(selected_xy_, xy_) for xy_ in xy])
        # Apply some kind of penalty function - this is the key
        dists_[dists_ < 90] -= 25000
        # Sum dists_ onto dists
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

The key to this is this line - the penalty function

dists_[dists_ < 90] -= 25000

This penalty function exists to prevent the code from just picking a ring of samples at the edge of the space, by artificially shortening values that are close together. However, this eventually breaks down, and the selection starts clustering, as shown in the image. You can clearly see that there are much better selections that the code can make before any kind of clustering is necessary. I feel that a kind of decaying exponential function would be best for this, but I do not know how to implement it. So my question is; how would I change the current penalty function to get what I'm looking for?


Solution

  • What I was looking for is called 'Furthest Point Sampling'. I have done some more research into the solution, and the Python code used to perform this is found here: https://minibatchai.com/ai/2021/08/07/FPS.html