I've been struggling with this one for a bit, so I thought I would reach out!
So I have two lists of index locations that I need to generate combinations from. (Originally I had one list, and attempted to use itertools.product and itertools.combinations, but the real data creates memory errors due to size.)
So originally: (think x,y coordinates)
coords = [[0, 0], [0, 1], [1, 0], [1, 1], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 7], [1, 8], [1, 9], [2, 0], [2, 1], [3, 0], [3, 1], [3, 2], [4, 0], [4, 1], [4, 2], [4, 3], [4, 4], [4, 5], [5, 0], [5, 1], [5, 2], [5, 3], [5, 4], [5, 5], [5, 6], [5, 7], [6, 0], [6, 1], [6, 2], [6, 3], [6, 4], [6, 5], [6, 6], [6, 7], [6, 8], [6, 9], [6, 10], [6, 11], [6, 12], [6, 13], [6, 14], [6, 15], [6, 16], [6, 17], [6, 18], [6, 19], [6, 20], [6, 21], [6, 22], [6, 23], [6, 24], [6, 25], [6, 26], [6,
27], [6, 28], [6, 29], [7, 0], [7, 1], [7, 2], [7, 3]]
#the coords get transformed into this:
#each "x" element contains the "y" sub elements
coord_list = [[0, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1], [0, 1, 2], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [0, 1, 2, 3]]
output = list(itertools.product(*coord))
This works until I get upwards of 20 levels in my index (I've only shown 7 levels of index in the example)
So I thought that I could limit the number of combinations being generated by splitting the list into the important characteristics that interest me, and limiting how many are used at a time.
I have a variable (cutoff) that defines how many items to pull from the first list (neg_list). A new list needs to be populated with those items from the neg_list, and then with elements from the other list (pos_list).
The catch is that you can only use one item from each index level, and I need the resulting lists to reuse items from the first list only if absolutely necessary. (Maybe by adding a counter to the elements?) - The goal is to use every element at least once, but distribute the times elements at a particular index level are reused as much as possible. ....Maybe itertools.takewhile() would be handy for this?
cutoff = 2
depth = 7 #The number of unique items in the first index position
pos_list = [[0, 1], [1, 1], [1, 7], [1, 8], [2, 0], [3, 1], [4, 1], [5, 1], [6, 1], [6, 2], [7, 1]]
neg_list = [[0, 0], [1, 0], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 9], [2, 1], [3, 0], [3, 2], [4, 0], [4, 2], [4, 3], [4, 4], [4, 5], [5, 0], [5, 2], [5, 3], [5, 4], [5, 5], [5, 6], [5, 7], [6, 0], [6, 3], [6, 4], [6, 5], [6, 6], [6, 7], [6, 8], [6, 9], [6, 10], [6, 11], [6, 12], [6, 13], [6, 14], [6, 15], [6, 16], [6, 17], [6, 18], [6, 19], [6, 20], [6, 21], [6, 22], [6, 23], [6, 24], [6, 25], [6, 26], [6, 27], [6, 28], [6, 29], [7, 0], [7, 2], [7, 3]]
pseudo code:
add use_count to each element of neg_list and pos_list
get cutoff number of elements randomly from neg_list with unique first index number by choosing lowest use_count until all items have a use_count > 0
populate remaining elements up to depth number with elements from pos_list with unique first index number and lowest use_count
increment use_count on used elements in neg_list and pos_list
pseudo output:
an array or list of lists with all the combinations generated
cutoff 2 partial example: (the ^^^ indicate where the neg_list "seeds" are)
[[0, 0], [1, 1], [2, 0], [3, 2], [4, 1], [5, 1], [6, 1], [7, 1]]
^^^^ ^^^^
[[0, 1], [1, 2], [2, 0], [3, 1], [4, 1], [5, 1], [6, 18], [7, 1]]
^^^^ ^^^^^
pos_list would then maybe look like:
[[[0, 1],1], [[1, 1],1], [1, 7], [1, 8], [[2, 0],2], [[3, 1],1], [[4, 1],2] [[5, 1],2], [[6, 1],1], [[6, 2],0], [[7, 1],2]]
neg list would look similar, with counts next to the elements that have been used
The cutoff is the only variable that can change. So a cutoff of 1, would generate 54 sets I think. A cutoff of two would generate a bunch of combinations while maximizing the variability of the elements used.
Thoughts? I'm not sure where to go with this one.
Parameters:
good_coords = [(0,0), (1,0)]
bad_coords = [(0,1), (1,1), (1,2)]
cutoff = 2
x
-indices of your data must be SORTED.from collections import defaultdict
from itertools import combinations, cycle
# 'xs' stands for 'x-es', plural of 'x'
xs = sorted(list(set(pair[0] for pair in good_coords)
.union(set(pair[0] for pair in bad_coords))))
# Define 2 dictionaries. In each one:
# - keys are unique values of `x`,
# - values are lists of `y` corresponding to that `x`.
# All good pairs are in `pairs_good`, and all bad ones - in `pairs_bad`:
pairs_good = defaultdict(list, {x:[] for x in xs})
pairs_bad = defaultdict(list, {x:[] for x in xs})
for x, y in good_coords:
pairs_good[x].append(y)
for x, y in bad_coords:
pairs_bad[x].append(y)
# Define a dictionary where:
# Keys will be all integers from 0 to `cutoff`: `n_bad`.
# Values will be lists containing other lists: sequences of `y` -
# all the results corresponding to this `n_bad`:
sequences_for_each_n_bad = {}
# for `n_bad == 0`:
chosen_ys = [cycle(ys) for ys in pairs_good.values()]
sequences = []
maxlen = max(len(pairs) for pairs in pairs_good.values())
seq_iterator = zip(*chosen_ys)
while maxlen > 0:
sequences.append(next(seq_iterator))
maxlen -= 1
sequences_for_each_n_bad[0] = sequences
# for all other `n_bad`: >0:
for n_bad in range(1, cutoff + 1):
sequences = []
# `good_ys` and `bad_ys` are cyclic infinite iterators
all_good_ys = [cycle(ys) for ys in pairs_good.values()]
all_bad_ys = [cycle(ys) for ys in pairs_bad.values()]
# Go through all the possibilities of bad index locations:
for bad_ind_locs in combinations(range(len(xs)), n_bad):
maxlen = max(len(pairs_bad[xs[i]]) for i in bad_ind_locs)
# Create a list of the CHOSEN cyclic iterators of `y`s -
# each of them good or bad, depending on `bad_ind_locs`:
chosen_ys = [bad_ys if idx in bad_ind_locs else good_ys
for idx, (good_ys, bad_ys)
in enumerate(zip(all_good_ys, all_bad_ys))]
# Iterate over all elements of all rows (in parallel),
# until the last element of the longest bad row is reached:
seq_iterator = zip(*chosen_ys)
while maxlen > 0:
sequences.append(next(seq_iterator))
maxlen -= 1
sequences_for_each_n_bad[n_bad] = sequences
Result for the inputs above:
sequences_for_each_n_bad
{0: [(0, 0)], 1: [(1, 0), (0, 1), (0, 2)], 2: [(1, 1), (1, 2)]}
The keys of the dictionary sequences_for_each_n_bad
are n_bads
(then number of bad elements in a sequence). The values are all chosen sequences of y
s, for this number of bad elements.
Note that x
always has the same values in the same positions (0, 1, 2) -- so it is not necessary to store them in the pairs. I am storing them in a list xs
(x-ses) (assuming that your x
s are in the ascending order).
If you would like to convert the output into your format, you could use:
aslist = [[list(zip(xs, sequence)) for sequence in sequences]
for n_bad, sequences
in sequences_for_each_n_bad.items()]
[sublist for n_bad_list in aslist
for sublist in n_bad_list]
[[(0, 0), (1, 0)],
[(0, 1), (1, 0)],
[(0, 0), (1, 1)],
[(0, 0), (1, 2)],
[(0, 1), (1, 1)],
[(0, 1), (1, 2)]]
good_coords = [(0,0), (1,0), (2,0)]
bad_coords = [(0,1), (1,1), (2,1), (0,2), (1,2), (2,2)]
cutoff = 2
{0: [(0, 0, 0)],
1: [(1, 0, 0), (2, 0, 0), (0, 1, 0), (0, 2, 0), (0, 0, 1), (0, 0, 2)],
2: [(1, 1, 0), (2, 2, 0), (1, 0, 1), (2, 0, 2), (0, 1, 1), (0, 2, 2)]}
good_coords = [(0,0), (1,0), (2,0), (3,0),(3,3)]
bad_coords = [(0,1), (1,1),(1,2), (2,1),(2,2),(2,3), (3,1),(3,2)]
cutoff = 2
{0: [(0, 0, 0, 0), (0, 0, 0, 3)],
1: [(1, 0, 0, 0),
(0, 1, 0, 3),
(0, 2, 0, 0),
(0, 0, 1, 3),
(0, 0, 2, 0),
(0, 0, 3, 3),
(0, 0, 0, 1),
(0, 0, 0, 2)],
2: [(1, 1, 0, 0),
(1, 2, 0, 3),
(1, 0, 1, 0),
(1, 0, 2, 3),
(1, 0, 3, 0),
(1, 0, 0, 1),
(1, 0, 0, 2),
(0, 1, 1, 3),
(0, 2, 2, 0),
(0, 1, 3, 3),
(0, 2, 0, 1),
(0, 1, 0, 2),
(0, 0, 1, 1),
(0, 0, 2, 2),
(0, 0, 3, 1)]}