Search code examples
pythonmemory-managementsetpython-itertools

Python: Combining itertools and sets to save memory


so I discovered Sets in Python a few days ago and am surprised that they never crossed my mind before even though they make a lot of things really simple. I give an example later.

Some things are still unclear to me. The docs say that Sets can be created from iterables and that the operators always return new Sets but do they always copy all data from one set to another and from the iterable? I work with a lot of data and would love to have Sets and set operators that behave much like itertools. So Sets([iterable]) would be more like a wrapper and the operators union, intersection and so on would return "iSets" and would not copy any data. They all would evaluate once I iter my final Set. In the end I really much would like to have "iSet" operators.

Purpose: I work with MongoDB using mongoengine. I have articles saved. Some are associated with a user, some are marked as read others were shown to the user and so on. Wrapping them in Sets that do not load all data would be a great way to combine, intersect etc. them. Obviously I could make special queries but not always since MongoDB does not support joins. So I end up doing joins in Python. I know I could use a relational database then, however, I don't need joins that often and the advantages of MongoDB outweigh them in my case.

So what do you think? Is there already a third party module? Would a few lines combining itertools and Sets do?

EDIT: I accepted the answer by Martijn Pieters because it is obviously right. I ended up loading only IDs into sets to work with them. Also, the sets in Python have a pretty good running time.


Solution

  • Sets are just like dict and list; on creation they copy the references from the seeding iterable.

    Iterators cannot be sets, because you cannot enforce the uniqueness requirement of a set. You cannot know if a future value yielded by an iterator has already been seen before.

    Moreover, in order for you to determine what the intersection is between two iterables, you have to load all data from at least one of these iterables to see if there are any matches. For each item in the second iterable, you need to test if that item has been seen in the first iterable. To do so efficiently, you need to have loaded all the items from the first iterable into a set. The alternative would be to loop through the first iterable from start to finish for each item from the second iterable, leading to exponential performance degradation.