Search code examples
pythonoptimizationbigdatapermutation

Checking a large list of permutations. Any advice to get it to run faster?


I have about 95,000,000 permutations to check. I have 8 lists of varying length, each string identifies properties (a-k) defined in an excel sheet. e.g

bcdgj

has properties b, c, d, g and j

I need to find just one permutation that contains at least 3 of every property and then match those properties to the data in the spreadsheet

I have made this script (my first attempt at using python)

import numpy
import itertools

for x in itertools.product(['abfhj','bcdgj','fghij','abcj','bdgk','abgi','cdei','cdgi','dgik','aghi','abgh','bfhk'],['cdei','bcdgj','abcgi','abcj','abfj','bdfj','cdgi','bhjk','bdgk','dgik'],['afhk','cdgik','cegik','bdgi','cgij','cdei','bcgi','abgh'],['fhjk','bdgij','cgij','abk','ajk','bdk','cik','cdk','cei','fgj'],['abe','abcf','afh','cdi','afj','cdg','abi','cei','cgk','ceg','cgi'],['cdgi','bcgj','bcgi','bcdg','abfh','bdhi','bdgi','bdk','fhk','bei','beg','fgi','abf','abc','egi'],['bcdgik','cegik','chik','afhj','abcj','abfj'],['ceg','bcfg','cgi','bdg','afj','cgj','fhk','cfk','dgk','bcj']):
    gear = ''.join(x)
    count_a = gear.count('a')
    count_b = gear.count('b')
    count_c = gear.count('c')
    count_d = gear.count('d')
    count_e = gear.count('e')
    count_f = gear.count('f')
    count_g = gear.count('g')
    count_h = gear.count('h')
    count_i = gear.count('i')
    count_j = gear.count('j')
    count_k = gear.count('k')
    score_a = numpy.clip(count_a, 0, 3)
    score_b = numpy.clip(count_b, 0, 3)
    score_c = numpy.clip(count_c, 0, 3)
    score_d = numpy.clip(count_d, 0, 3)
    score_e = numpy.clip(count_e, 0, 3)
    score_f = numpy.clip(count_f, 0, 3)
    score_g = numpy.clip(count_g, 0, 3)
    score_h = numpy.clip(count_h, 0, 3)
    score_i = numpy.clip(count_i, 0, 3)
    score_j = numpy.clip(count_j, 0, 3)
    score_k = numpy.clip(count_k, 0, 3)
    rating = score_a + score_b + score_c + score_d + score_e + score_f + score_g + score_h + score_i + score_j + score_k
    if rating == 33:
        print(x)
        print(rating)

I've adjusted the rating requirement to test that it's working, it is but it's going to take a while to crunch through 95,000,000 permutations. Anyone have any advice for getting it to run faster? I think I've already reduced the number of values in each list as much as I can, the excel sheet the data comes from has several hundred entries per list and I've managed to reduce it to 6-12 per list.


Solution

  • from itertools import groupby, product
    
    data = (
        ['abfhj','bcdgj','fghij','abcj','bdgk','abgi','cdei','cdgi','dgik','aghi','abgh','bfhk'],
        ['cdei','bcdgj','abcgi','abcj','abfj','bdfj','cdgi','bhjk','bdgk','dgik'],
        ['afhk','cdgik','cegik','bdgi','cgij','cdei','bcgi','abgh'],
        ['fhjk','bdgij','cgij','abk','ajk','bdk','cik','cdk','cei','fgj'],
        ['abe','abcf','afh','cdi','afj','cdg','abi','cei','cgk','ceg','cgi'],
        ['cdgi','bcgj','bcgi','bcdg','abfh','bdhi','bdgi','bdk','fhk','bei','beg','fgi','abf','abc ','egi'],
        ['bcdgik','cegik','chik','afhj','abcj','abfj'],
        ['ceg','bcfg','cgi','bdg','afj','cgj','fhk','cfk','dgk','bcj'],
    )
    
    REQ_PROPS = set("abcdefghijk")
    
    for x in product(*data):
        permu = ''.join(x)
        # if the permutation does not contain all letters from a-k, skip it.
        if REQ_PROPS.difference(permu):
            continue
    
        prop_map = dict.fromkeys(permu)
        for prop, group in groupby(sorted(permu)):
            group_rating = len(tuple(group))
            # dont bother searching more props of this permutation if the current
            # property has a rating less than 3
            if group_rating < 3:
                break
            prop_map[prop] = group_rating
        # check if this permutation satisfies the requirement and exit if it does.
        if all(v is not None for v in prop_map.values()):
            print(x)
            print(prop_map)  # total rating of each property
            break
    

    Not sure if this is what you are looking for but the most obvious optimization would be to "short-circuit" the search by stopping as soon as a permutation does not have all required properties or when a permutation's property fails to have at least 3 instances. This is achieved by first sorting the permutation and then grouping it by property. After grouping you check if each property's frequency is not less than 3. If the check does not fail for all properties then you have your answer, else move on to the next permutation.

    I ran the program using the example data provided it appears that there is no solution that contains all a-k properties. Maybe you need a bigger dataset.