Search code examples
pythonperformancecomparison

Pairwise comparison of elements of a set


I have a dataset of images, which are sorted with respect to their interestingness. I want to create a diverse subset by computing the pairwise similarities between a candidate element and all current members of the subset.

My code is below. It looks like working correctly; however, as you can guess it is very slow (takes 6-7 hours).

I wonder if there is another way to create this subset faster than this implementation.

dataset = [a,b,c,d,e,f,g,...] #an example sorted dataset. real size is 20K
subset = [] #my diverse set. Desired size is 200

for ind, element in enumerate(dataset):
  if ind == 0:
    subset.append(element[ind]) #since the dataset is sorted I can directly add the first element
  else:
    similarities = []
    for member in subset:
      similarities.append(compute_similarity(element[ind], member))

    if not any(similarities): #if the candidate is not similar to any of the existing members
      subset.append(element[ind])

  if len(subset) >= 200:
    break

Solution

  • The main improvement I see is to break as soon as you find that the current element is similar to some element already in your set. This allows for a simplification of your code, even eliminating the need for the extra check for the first element.

    dataset = [a,b,c,d,e,f,g,...] #an example sorted dataset. real size is 20K
    subset = [] #my diverse set. Desired size is 200
    
    for candidate in dataset:
      for member in subset:
        if compute_similarity(candidate, member):
          break
      else:
        subset.append(candidate)
        if len(subset) >= 200:
          break
    

    Note the for-else construct, which only executes the "else" branch if the for-loop did not break.

    Any other optimizations would depend on your actual similarity metric. Maybe you can somehow sort the data, thus reducing which comparisons you need to do.