Search code examples
pythonunionpermutationpython-itertools

Fastest way to find if a Group of potentials exists in a list of all solutions in Python


I am attempting this for something other than the example, but I thought this would be the easiest explanation.

Consider you have a chain of potential items and want to find if that exists in a known list of solutions.

    solutions = ['GAT','CAT','GCT']

And you have the potentials of:

    potentials = [['G','T'],['T','A'],['T']]

Such that you would be able to determine that GAT is the only solution that can be created from the potential.

AND

You want to find all potential solutions such that

    potentials = [['G','C','A','T'],['G','C','A','T'],['G','C','A','T']]

Would return ['GAT','CAT','GCT'] since all solutions could be made from the potential.

My current solution simply builds all the combinations using a JOIN and then does an intersection.

Because of the next step, an ideal solution would actually return a the solutions such that viable options from each item was output such that:

    solutions = ['GAT','CAT','GCT']
    potentials = [['G','C','A','T'],['G','C','A'],['A','T']]
    imaginaryfunction(potentials, solutions)

Would return:

     [['G','C'],['C','A'],['T']]

If it matters, in my real world case there are 30-ish "letters", and any given item in the potential list could be up to 8 of those "letters". The solutions list is about 2000 three letter combinations. In an ideal world there is only 1 combination that works, but more typically there are 4 or 5 and it is possible but highly improbable to have have all 2000 be possible.

I have experimented with some pretty wild things ranging from joins, to making a base 30 number out of each combination so that I could just check if the number was in the list to breaking all of the solutions down into a dictionary of dictionaries and seeing if the items were there.

My code loops through this 100s of thousands if not millions of times so even small gains add up fast for me.

Edit/Additional Info:

This is used in several places. I am working with a group that does drug interaction simulations, and for them they are looking at chains which could be attachment/bonding points.

It is also used with a company that does those "Personal Stylist will pick your wardrobe" monthly boxes. (In that case it might be XS,S,M,L,LT,XL,XXL for the first match set, and Red,Green,Blue,Black,White for a second with Sleeve Length for the third) (they don't actually have someone pick what goes in the boxes)

Same block of code is also in my code for NGram Analysis.

Same block of code is in Dating app as Race(s), Gender(s), Age(s), Income(s)


Solution

  • Since your set of solutions is quite small (2000) compared to the theoretical number of possible combinations (order of 30^8) I think something like this could work:

    solutions = get_possible_solutions()
    for i, pset in enumerate(potentials):
        matching_solutions = []
        for sol in solutions:
            if sol[i] in pset:
                matching_solutions.append(sol)
        solutions = matching_solutions # Remove non-matches
    

    This solution iteratively filters the possible solutions using the potentials for the current letter in the item "string". The inner loop will get smaller every iteration, so this has the potential to be quite fast. It should at least be a lot faster than doing an explicit intersection between all possible solutions and the input solutions.