Search code examples
pythonstatisticscombinations

Get an evenly distributed subset of combinations without repetition


I'm trying to get a subset of combinations such that every option is used the same amount of times, or close to it, from the total set of combinations without repetition. For example, I have 8 options (let's say A-H) and I need combinations of 4 letters where order doesn't matter. That would give me 70 possible combinations. I would like to take a subset of those combinations such that A appears as much as each other letter does, and A appears with B as much as C appears with D, etc. I know there are subsets where it is impossible to have each letter appear the same amount of times and appear with another letter the same amount of times so when I say "same amount of times" in this post, I mean the same amount or close to it.

If the options are written out in an organized list as is shown below, I couldn't just select the first N options because that would give A far more use than it would H. Also, A and B would appear together more than C and D. The main idea is to get as evenly distributed use of each letter combination as possible.

ABCD ABCE ABCF ABCG ABCH ABDE ABDF ABDG ABDH ABEF ABEG ABEH ABFG ABFH ABGH ACDE ACDF ACDG ACDH ACEF ACEG ACEH ACFG ACFH ACGH ADEF ADEG ADEH ADFG ADFH ADGH AEFG AEFH AEGH AFGH BCDE BCDF BCDG BCDH BCEF BCEG BCEH BCFG BCFH BCGH BDEF BDEG BDEH BDFG BDFH BDGH BEFG BEFH BEGH BFGH CDEF CDEG CDEH CDFG CDFH CDGH CEFG CEFH CEGH CFGH DEFG DEFH DEGH DFGH EFGH

I could take a random sample but being random, it doesn't exactly meet my requirements of taking a subset intentionally to get an even distribution. It could randomly choose a very uneven distribution.

Is there a tool or a mathematical formula to generate a list like I'm asking for? Building one in Python or some other coding language is an option if I had an idea of how to go about it.

EDIT - Think of it like ingredients. A-H are each unique ingredients such as flour, sugar, butter, etc. I'm creating recipes that use four of the available ingredients (not worrying about measurements, just what ingredients are used).

There are 70 possible combinations of the 8 ingredients (put in 8 objects and 4 sample size here). I want a list of a subset of those combinations, of a size I put into the formula. As an example, let's say I want 20 of the combinations. I want each ingredient to show up as much as the other ingredients, meaning among those combinations, maybe A (flour) is used 3 times, B (sugar) is used 3 times, C (butter) is used 3 times, etc.

I also want A (flour) and B (sugar) to be used together as many times as B (sugar) and C (butter) are used together in a combination.

Most subset sizes won't allow perfect distribution of items, but it should give a list that is as close to an even distribution as possible.

I have created a Python script that I think does what I want. I'm cleaning it up and testing it more, then I'll post it here.

To show an example of correct results, I found that a subset of size 14 will produce a perfect distribution:

ABCD  ADEG  BDGH
ABEH  ADFH  CDEH
ABFG  BCEG  CDFG
ACEF  BCFH  EFGH
ACGH  BDEF

You can see that in that subset, each letter is used 7 times, and each pairing (such as A and B) appear 3 times, and each triplet (such as A and B and C) appears once. This is the exact type of result I'm looking for. I hope that clarifies the problem for readers.

I'll post my python code soon.


Solution

  • Here's the Python code to create a balanced subset from a list of options. I commented enough for it to explain itself. In a nutshell, it iterates through the possible combinations, scoring each option according to how well it keeps the resulting list balanced. The combinations with the best scores get included in the final list of selections.

    import itertools
    import random
    
    allOptions = 'ABCDEFGH'
    # allOptions can be a list of anything, such as:
    # allOptions = ['Flour', 'Sugar', 'Eggs', 'Frosting', 'Butter', 'Chocolate', 'Berries', 'Milk']
    
    # Subset sizes with perfect distributions: 14, 28, 70
    subsetSize = 14
    
    # The number of letters in the combos of final results (i.e. ABCD)
    comboSize = 4
    
    # 2D list of counts of each combo, indexed by # of characters, for example:
    #  count[1] = ["A"=7, "B"=7, "C"=7...] number of times single characters appear
    #  count[2] = ["AB"=3, "AC"=3, "AD"=3...] number of times pairs appear together
    #  count[3] = ["ABC"=1, "ABD"=1, "ABE"=1...] number of times three letters appear together
    selectionCount = [dict()] * comboSize
    
    # Initialized to 1 since scores start at 0, indexes are # of characters like selectionCount
    maxScoreTarget = [1] * comboSize
    
    # Initialized to 0, will be used to keep list balanced 
    overageScore = [0] * comboSize
    
    
    # The primary function that will iterate over the list of combinations, selecting the options
    #  that create a balanced subset.
    def selectCombos():
        global overageScore
        global selectionCount
    
        # The final selection results
        selections = [] # list of objects [("a", "b", "c", "d"), ("e", "f", "g", "h")]
        
        # Initialize selectionCount to use for keeping track of singles, pairs, and trios usage
        selectionCount = makeCounts()
        
        # Create the list that will be pulled from as we identify the best combos
        comboOptions = list(itertools.combinations(allOptions, 4))
    
        # Shuffle the list of combos so it isn't always starting with ABCD
        # This helps with testing to see if the results are consistent
        random.shuffle(comboOptions)
    
        # Used to hold the current combo selection
        selection = ""
        
        # Each run through this loop will select another combo
        for subsetCount in range(1, subsetSize):
            # We have to start somewhere so the first in the list is the first to be selected
            if len(selections) == 0:
                selection = comboOptions[0]
                # Move the selection from the comboOptions list to the selections list. 
                selections.append(selection)
                comboOptions.remove(selection)
                # Update the count of each single letter, pair, and trio
                adjustCounts(selection)
    
            # A dict to hold scores of the combos, such as 'ABCD': 5. See scoring function.
            optionScores = dict()
            lowestScore = 999    # initialize to something that will be replaced.
            # Overage score 
            overageScore = [0] * comboSize
    
            # Calling a 4 letter combo a 'word'
            for word in comboOptions:
                # Make wordIndex from combo by turning ('A', 'B', 'C', 'D') into 'ABCD'.
                wordIndex = "".join(word)
                # Score the word
                optionScores[wordIndex] = scoreCandidate(word)
                # If score is less than lowestScore, replace lowest score and set selection
                if optionScores[wordIndex] < lowestScore or optionScores[wordIndex] == 0:
                    lowestScore = optionScores[wordIndex]
                    selection = word
            # We'll call the number of letters the category, i.e. 2 is 2 letter pairs.
            for category in range(1, comboSize):
                # If every combo caused a count to exceed the maxScore, it's time to increase the score
                if overageScore[category] >= len(comboOptions)-1:
                    maxScoreTarget[category] += 1
    
            if selection:
                # Move our selection from the comboOptions list to the selections list
                selections.append(selection)
                comboOptions.remove(selection)
                # Update the count of each single letter, pair, and trio
                adjustCounts(selection)
            else:
                print("THERE WAS NO SELECTION")
        
        # Create a list of the joined combos so it's easy to read ('ABCD' instead of ('A', 'B', 'C', 'D'))
        easylist = []
        for word in selections:
            easylist.append(''.join(word))
        # Sort them so it's easier to compare
        easylist.sort()
    
        # Output the resulting selections
        print('SELECTIONS')
        for word in easylist:
            print(word)
        
        # Output the count of each single letter, pair, and trio
        for catSet in selectionCount:
            print('----')
            for value in catSet:
                print(value, catSet[value])
    
    
    # Adjusts the count of each single letter, pair, and trio based on the selection we've identified
    def adjustCounts(selection):
        global selectionCount
        # Get all the singles, pairs, and trios in the selection
        selectionCombos = getWordCombos(selection)
        for category in range(1, comboSize):
            for combo in selectionCombos[category]:
                comboIndex = "".join(combo)
                # Increment the count for the combo in the selection count dict
                selectionCount[category][comboIndex] += 1
    
    
    # Scores a candidate combo, based on how it affects the count of single letters, pairs, and trios
    def scoreCandidate(word):
        global overageScore
        global selectionCount
        # word = ["a","b","c","d"]
        wordScore = 0
        # Checking each category, which is how many letters again.
        for category in range(1, comboSize):
            isOverage = False
            # Get all the [category] letter combinations of this 4 letter combo
            for item in list(itertools.combinations(word, category)):
                index = "".join(item)
                # See if the option would increase the selectionCount to exceed the maxScoreTarget
                if selectionCount[category][index] + 1 > maxScoreTarget[category]:
                    difference = selectionCount[category][index] + 1 - maxScoreTarget[category]
                    # The more the candidate exceeds the maxScoreTarget, the worse it is.
                    #  This gives preference to candidates that don't cause so much imbalance.
                    match difference:
                        case 1:
                            wordScore += 1
                        case 2:
                            wordScore += 5
                        case 3:
                            wordScore += 20
                    # Since increasing the selectionCount for this option exceeds the maxScoreTarget, set the flag
                    isOverage = True
    
            # When there's an overage, add to the overageScore for the category so we know if it's
            #  not possible to stay under the maxScoreTarget with the options we have left.
            if isOverage:
                overageScore[category] += 1
    
        return wordScore
    
    
    # Returns a list of the combos FROM THE SELECTION
    #  i.e. ABCD contains A, B, C, D single letters, AB, AC, AD, BC, BD, CD pairs, etc.
    def getWordCombos(word):
        combos = [0]*comboSize
        for category in range(1, comboSize):
            combos[category] = list(itertools.combinations(word, category))
            # Shuffling to ensure we're testing new things each time
            random.shuffle(combos[category])
        return combos
    
    
    # Creates a list of dict objects, indexed by the number of letters. Each dict is indexed
    #  by the combination of letters.
    def makeCounts():
        counters = [dict()] * comboSize
        for category in range(1, comboSize):
            counters[category] = dict()
            for combo in list(itertools.combinations(allOptions, category)):
                index = "".join(combo)
                # Initialize all counters to 0
                counters[category][index] = 0
        return counters
    
    
    if __name__ == '__main__':
        print("--------- Beginning...")
        selectCombos()
        exit(0)
        
    

    The output from running the script to get a subset size of 14, looks like this:

    --------- Beginning...
    SELECTIONS
    ABCF
    ABDH
    ABEG
    ACDG
    ACEH
    ADEF
    AFGH
    BCDE
    BCGH
    BDFG
    BEFH
    CDFH
    CEFG
    DEGH
    ----
    ----
    A 7
    B 7
    C 7
    D 7
    E 7
    F 7
    G 7
    H 7
    ----
    AB 3
    AC 3
    AD 3
    AE 3
    AF 3
    AG 3
    AH 3
    BC 3
    BD 3
    BE 3
    BF 3
    BG 3
    BH 3
    CD 3
    CE 3
    CF 3
    CG 3
    CH 3
    DE 3
    DF 3
    DG 3
    DH 3
    EF 3
    EG 3
    EH 3
    FG 3
    FH 3
    GH 3
    ----
    ABC 1
    ABD 1
    ABE 1
    ABF 1
    ABG 1
    ABH 1
    ACD 1
    ACE 1
    ACF 1
    ACG 1
    ACH 1
    ADE 1
    ADF 1
    ADG 1
    ADH 1
    AEF 1
    AEG 1
    AEH 1
    AFG 1
    AFH 1
    AGH 1
    BCD 1
    BCE 1
    BCF 1
    BCG 1
    BCH 1
    BDE 1
    BDF 1
    BDG 1
    BDH 1
    BEF 1
    BEG 1
    BEH 1
    BFG 1
    BFH 1
    BGH 1
    CDE 1
    CDF 1
    CDG 1
    CDH 1
    CEF 1
    CEG 1
    CEH 1
    CFG 1
    CFH 1
    CGH 1
    DEF 1
    DEG 1
    DEH 1
    DFG 1
    DFH 1
    DGH 1
    EFG 1
    EFH 1
    EGH 1
    FGH 1
    

    I hope this helps someone else solve this type of problem!