I'm trying to get a subset of combinations such that every option is used the same amount of times, or close to it, from the total set of combinations without repetition. For example, I have 8 options (let's say A-H) and I need combinations of 4 letters where order doesn't matter. That would give me 70 possible combinations. I would like to take a subset of those combinations such that A appears as much as each other letter does, and A appears with B as much as C appears with D, etc. I know there are subsets where it is impossible to have each letter appear the same amount of times and appear with another letter the same amount of times so when I say "same amount of times" in this post, I mean the same amount or close to it.
If the options are written out in an organized list as is shown below, I couldn't just select the first N options because that would give A far more use than it would H. Also, A and B would appear together more than C and D. The main idea is to get as evenly distributed use of each letter combination as possible.
ABCD ABCE ABCF ABCG ABCH ABDE ABDF ABDG ABDH ABEF ABEG ABEH ABFG ABFH ABGH ACDE ACDF ACDG ACDH ACEF ACEG ACEH ACFG ACFH ACGH ADEF ADEG ADEH ADFG ADFH ADGH AEFG AEFH AEGH AFGH BCDE BCDF BCDG BCDH BCEF BCEG BCEH BCFG BCFH BCGH BDEF BDEG BDEH BDFG BDFH BDGH BEFG BEFH BEGH BFGH CDEF CDEG CDEH CDFG CDFH CDGH CEFG CEFH CEGH CFGH DEFG DEFH DEGH DFGH EFGH
I could take a random sample but being random, it doesn't exactly meet my requirements of taking a subset intentionally to get an even distribution. It could randomly choose a very uneven distribution.
Is there a tool or a mathematical formula to generate a list like I'm asking for? Building one in Python or some other coding language is an option if I had an idea of how to go about it.
EDIT - Think of it like ingredients. A-H are each unique ingredients such as flour, sugar, butter, etc. I'm creating recipes that use four of the available ingredients (not worrying about measurements, just what ingredients are used).
There are 70 possible combinations of the 8 ingredients (put in 8 objects and 4 sample size here). I want a list of a subset of those combinations, of a size I put into the formula. As an example, let's say I want 20 of the combinations. I want each ingredient to show up as much as the other ingredients, meaning among those combinations, maybe A (flour) is used 3 times, B (sugar) is used 3 times, C (butter) is used 3 times, etc.
I also want A (flour) and B (sugar) to be used together as many times as B (sugar) and C (butter) are used together in a combination.
Most subset sizes won't allow perfect distribution of items, but it should give a list that is as close to an even distribution as possible.
I have created a Python script that I think does what I want. I'm cleaning it up and testing it more, then I'll post it here.
To show an example of correct results, I found that a subset of size 14 will produce a perfect distribution:
ABCD ADEG BDGH
ABEH ADFH CDEH
ABFG BCEG CDFG
ACEF BCFH EFGH
ACGH BDEF
You can see that in that subset, each letter is used 7 times, and each pairing (such as A and B) appear 3 times, and each triplet (such as A and B and C) appears once. This is the exact type of result I'm looking for. I hope that clarifies the problem for readers.
I'll post my python code soon.
Here's the Python code to create a balanced subset from a list of options. I commented enough for it to explain itself. In a nutshell, it iterates through the possible combinations, scoring each option according to how well it keeps the resulting list balanced. The combinations with the best scores get included in the final list of selections.
import itertools
import random
allOptions = 'ABCDEFGH'
# allOptions can be a list of anything, such as:
# allOptions = ['Flour', 'Sugar', 'Eggs', 'Frosting', 'Butter', 'Chocolate', 'Berries', 'Milk']
# Subset sizes with perfect distributions: 14, 28, 70
subsetSize = 14
# The number of letters in the combos of final results (i.e. ABCD)
comboSize = 4
# 2D list of counts of each combo, indexed by # of characters, for example:
# count[1] = ["A"=7, "B"=7, "C"=7...] number of times single characters appear
# count[2] = ["AB"=3, "AC"=3, "AD"=3...] number of times pairs appear together
# count[3] = ["ABC"=1, "ABD"=1, "ABE"=1...] number of times three letters appear together
selectionCount = [dict()] * comboSize
# Initialized to 1 since scores start at 0, indexes are # of characters like selectionCount
maxScoreTarget = [1] * comboSize
# Initialized to 0, will be used to keep list balanced
overageScore = [0] * comboSize
# The primary function that will iterate over the list of combinations, selecting the options
# that create a balanced subset.
def selectCombos():
global overageScore
global selectionCount
# The final selection results
selections = [] # list of objects [("a", "b", "c", "d"), ("e", "f", "g", "h")]
# Initialize selectionCount to use for keeping track of singles, pairs, and trios usage
selectionCount = makeCounts()
# Create the list that will be pulled from as we identify the best combos
comboOptions = list(itertools.combinations(allOptions, 4))
# Shuffle the list of combos so it isn't always starting with ABCD
# This helps with testing to see if the results are consistent
random.shuffle(comboOptions)
# Used to hold the current combo selection
selection = ""
# Each run through this loop will select another combo
for subsetCount in range(1, subsetSize):
# We have to start somewhere so the first in the list is the first to be selected
if len(selections) == 0:
selection = comboOptions[0]
# Move the selection from the comboOptions list to the selections list.
selections.append(selection)
comboOptions.remove(selection)
# Update the count of each single letter, pair, and trio
adjustCounts(selection)
# A dict to hold scores of the combos, such as 'ABCD': 5. See scoring function.
optionScores = dict()
lowestScore = 999 # initialize to something that will be replaced.
# Overage score
overageScore = [0] * comboSize
# Calling a 4 letter combo a 'word'
for word in comboOptions:
# Make wordIndex from combo by turning ('A', 'B', 'C', 'D') into 'ABCD'.
wordIndex = "".join(word)
# Score the word
optionScores[wordIndex] = scoreCandidate(word)
# If score is less than lowestScore, replace lowest score and set selection
if optionScores[wordIndex] < lowestScore or optionScores[wordIndex] == 0:
lowestScore = optionScores[wordIndex]
selection = word
# We'll call the number of letters the category, i.e. 2 is 2 letter pairs.
for category in range(1, comboSize):
# If every combo caused a count to exceed the maxScore, it's time to increase the score
if overageScore[category] >= len(comboOptions)-1:
maxScoreTarget[category] += 1
if selection:
# Move our selection from the comboOptions list to the selections list
selections.append(selection)
comboOptions.remove(selection)
# Update the count of each single letter, pair, and trio
adjustCounts(selection)
else:
print("THERE WAS NO SELECTION")
# Create a list of the joined combos so it's easy to read ('ABCD' instead of ('A', 'B', 'C', 'D'))
easylist = []
for word in selections:
easylist.append(''.join(word))
# Sort them so it's easier to compare
easylist.sort()
# Output the resulting selections
print('SELECTIONS')
for word in easylist:
print(word)
# Output the count of each single letter, pair, and trio
for catSet in selectionCount:
print('----')
for value in catSet:
print(value, catSet[value])
# Adjusts the count of each single letter, pair, and trio based on the selection we've identified
def adjustCounts(selection):
global selectionCount
# Get all the singles, pairs, and trios in the selection
selectionCombos = getWordCombos(selection)
for category in range(1, comboSize):
for combo in selectionCombos[category]:
comboIndex = "".join(combo)
# Increment the count for the combo in the selection count dict
selectionCount[category][comboIndex] += 1
# Scores a candidate combo, based on how it affects the count of single letters, pairs, and trios
def scoreCandidate(word):
global overageScore
global selectionCount
# word = ["a","b","c","d"]
wordScore = 0
# Checking each category, which is how many letters again.
for category in range(1, comboSize):
isOverage = False
# Get all the [category] letter combinations of this 4 letter combo
for item in list(itertools.combinations(word, category)):
index = "".join(item)
# See if the option would increase the selectionCount to exceed the maxScoreTarget
if selectionCount[category][index] + 1 > maxScoreTarget[category]:
difference = selectionCount[category][index] + 1 - maxScoreTarget[category]
# The more the candidate exceeds the maxScoreTarget, the worse it is.
# This gives preference to candidates that don't cause so much imbalance.
match difference:
case 1:
wordScore += 1
case 2:
wordScore += 5
case 3:
wordScore += 20
# Since increasing the selectionCount for this option exceeds the maxScoreTarget, set the flag
isOverage = True
# When there's an overage, add to the overageScore for the category so we know if it's
# not possible to stay under the maxScoreTarget with the options we have left.
if isOverage:
overageScore[category] += 1
return wordScore
# Returns a list of the combos FROM THE SELECTION
# i.e. ABCD contains A, B, C, D single letters, AB, AC, AD, BC, BD, CD pairs, etc.
def getWordCombos(word):
combos = [0]*comboSize
for category in range(1, comboSize):
combos[category] = list(itertools.combinations(word, category))
# Shuffling to ensure we're testing new things each time
random.shuffle(combos[category])
return combos
# Creates a list of dict objects, indexed by the number of letters. Each dict is indexed
# by the combination of letters.
def makeCounts():
counters = [dict()] * comboSize
for category in range(1, comboSize):
counters[category] = dict()
for combo in list(itertools.combinations(allOptions, category)):
index = "".join(combo)
# Initialize all counters to 0
counters[category][index] = 0
return counters
if __name__ == '__main__':
print("--------- Beginning...")
selectCombos()
exit(0)
The output from running the script to get a subset size of 14, looks like this:
--------- Beginning...
SELECTIONS
ABCF
ABDH
ABEG
ACDG
ACEH
ADEF
AFGH
BCDE
BCGH
BDFG
BEFH
CDFH
CEFG
DEGH
----
----
A 7
B 7
C 7
D 7
E 7
F 7
G 7
H 7
----
AB 3
AC 3
AD 3
AE 3
AF 3
AG 3
AH 3
BC 3
BD 3
BE 3
BF 3
BG 3
BH 3
CD 3
CE 3
CF 3
CG 3
CH 3
DE 3
DF 3
DG 3
DH 3
EF 3
EG 3
EH 3
FG 3
FH 3
GH 3
----
ABC 1
ABD 1
ABE 1
ABF 1
ABG 1
ABH 1
ACD 1
ACE 1
ACF 1
ACG 1
ACH 1
ADE 1
ADF 1
ADG 1
ADH 1
AEF 1
AEG 1
AEH 1
AFG 1
AFH 1
AGH 1
BCD 1
BCE 1
BCF 1
BCG 1
BCH 1
BDE 1
BDF 1
BDG 1
BDH 1
BEF 1
BEG 1
BEH 1
BFG 1
BFH 1
BGH 1
CDE 1
CDF 1
CDG 1
CDH 1
CEF 1
CEG 1
CEH 1
CFG 1
CFH 1
CGH 1
DEF 1
DEG 1
DEH 1
DFG 1
DFH 1
DGH 1
EFG 1
EFH 1
EGH 1
FGH 1
I hope this helps someone else solve this type of problem!