Search code examples
pythonnumpyperformanceuniquefifo

Efficient reverse order comparison of huge growing list in Python


In Python, my goal is to maintain a unique list of points (complex scalars, rounded), while steadily creating new ones with a function, like in this pseudo code

list_of_points = []

while True
   # generate new point according to some rule
   z = generate() 

   # check whether this point is already there
   if z not in list_of_points:
      list_of_points.append(z)
   
   if some_condition:
      break

Now list_of_points can become potentially huge (like 10 million entries or even more) during the process and duplicates are quite frequent. In fact about 50% of the time, a newly created point is already somewhere in the list. However, what I know is that oftentimes the already existing point is near the end of the list. Sometimes it is in the "bulk" and only very occasionally it can be found near the beginning.

This brought me to the idea of doing the search in reverse order. But how would I do this most efficiently (in terms of raw speed), given my potentially large list which grows during the process. Is the list container even the best way here?

I managed to gain some performance by doing this

list_of_points = []

while True
   # generate new point according to some rule
   z = generate() 

   # check very end of list
   if z in list_of_points[-10:]:
      continue

   # check deeper into the list
   if z in list_of_points[-100:-10]:
      continue

   # check the rest
   if z not in list_of_points[:-100]:
      list_of_points.append(z)

   if some_condition:
      break

Apparently, this is not very elegant. Using instead a second, FIFO-type container (collection.deque), gives about the same speed up.


Solution

  • Your best bet might to be to use a set instead of a list, python sets use hashing to insert items, so it is very fast. And, you can skip the step of checking if an item is already in the list by simply trying to add it, if it is already in the set it wont be added since duplicates are not allowed.

    Stealing your pseudo code axample

    set_of_points = {}
    
    while True
       # get size of set
       a = len(set_of_points)
    
       # generate new point according to some rule
       z = generate() 
    
       # try to add z to the set
       set_of_points.add(z)
    
       b = len(set_of_points)
    
       # if a == b  it was not added, thus already existed in the set
    
       if some_condition:
          break