Search code examples
pythonloopsset

Better to add item to a set, or convert final list to a set?


I have some data that looks something like this:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...  

where each row is a group.

My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.

For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

I can think of 3 options for this, and I'm wondering which is (generally) best:

  1. Check whether an ID is already in the list before adding it
  2. Create sets instead of lists, and add each ID to the set
  3. Add all IDs to the list, then convert all of the lists to sets at the end.

Solution

  • TL;DR: Go with option 2. Just use sets from the start.

    In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

    So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

    Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.