Create combinations from two lists based on variable number of items from first list

I've been struggling with this one for a bit, so I thought I would reach out!

So I have two lists of index locations that I need to generate combinations from. (Originally I had one list, and attempted to use itertools.product and itertools.combinations, but the real data creates memory errors due to size.)

So originally: (think x,y coordinates)

coords = [[0, 0], [0, 1], [1, 0], [1, 1], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 7], [1, 8], [1, 9], [2, 0], [2, 1], [3, 0], [3, 1], [3, 2], [4, 0], [4, 1], [4, 2], [4, 3], [4, 4], [4, 5], [5, 0], [5, 1], [5, 2], [5, 3], [5, 4], [5, 5], [5, 6], [5, 7], [6, 0], [6, 1], [6, 2], [6, 3], [6, 4], [6, 5], [6, 6], [6, 7], [6, 8], [6, 9], [6, 10], [6, 11], [6, 12], [6, 13], [6, 14], [6, 15], [6, 16], [6, 17], [6, 18], [6, 19], [6, 20], [6, 21], [6, 22], [6, 23], [6, 24], [6, 25], [6, 26], [6,
27], [6, 28], [6, 29], [7, 0], [7, 1], [7, 2], [7, 3]]

#the coords get transformed into this:
#each "x" element contains the "y" sub elements

coord_list = [[0, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1], [0, 1, 2], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [0, 1, 2, 3]]

output = list(itertools.product(*coord))

This works until I get upwards of 20 levels in my index (I've only shown 7 levels of index in the example)

So I thought that I could limit the number of combinations being generated by splitting the list into the important characteristics that interest me, and limiting how many are used at a time.

I have a variable (cutoff) that defines how many items to pull from the first list (neg_list). A new list needs to be populated with those items from the neg_list, and then with elements from the other list (pos_list).

The catch is that you can only use one item from each index level, and I need the resulting lists to reuse items from the first list only if absolutely necessary. (Maybe by adding a counter to the elements?) - The goal is to use every element at least once, but distribute the times elements at a particular index level are reused as much as possible. ....Maybe itertools.takewhile() would be handy for this?

cutoff = 2
depth = 7  #The number of unique items in the first index position

pos_list = [[0, 1], [1, 1], [1, 7], [1, 8], [2, 0], [3, 1], [4, 1], [5, 1], [6, 1], [6, 2], [7, 1]]
neg_list = [[0, 0], [1, 0], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 9], [2, 1], [3, 0], [3, 2], [4, 0], [4, 2], [4, 3], [4, 4], [4, 5], [5, 0], [5, 2], [5, 3], [5, 4], [5, 5], [5, 6], [5, 7], [6, 0], [6, 3], [6, 4], [6, 5], [6, 6], [6, 7], [6, 8], [6, 9], [6, 10], [6, 11], [6, 12], [6, 13], [6, 14], [6, 15], [6, 16], [6, 17], [6, 18], [6, 19], [6, 20], [6, 21], [6, 22], [6, 23], [6, 24], [6, 25], [6, 26], [6, 27], [6, 28], [6, 29], [7, 0], [7, 2], [7, 3]]

pseudo code:
add use_count to each element of neg_list and pos_list
get cutoff number of elements randomly from neg_list with unique first index number by choosing lowest use_count until all items have a use_count > 0
populate remaining elements up to depth number with elements from pos_list with unique first index number and lowest use_count
increment use_count on used elements in neg_list and pos_list

pseudo output:
an array or list of lists with all the combinations generated
cutoff 2 partial example: (the ^^^ indicate where the neg_list "seeds" are)

[[0, 0], [1, 1], [2, 0], [3, 2], [4, 1], [5, 1], [6, 1], [7, 1]]
  ^^^^                    ^^^^
[[0, 1], [1, 2], [2, 0], [3, 1], [4, 1], [5, 1], [6, 18], [7, 1]]
          ^^^^                                    ^^^^^


pos_list would then maybe look like:
[[[0, 1],1], [[1, 1],1], [1, 7], [1, 8], [[2, 0],2], [[3, 1],1], [[4, 1],2] [[5, 1],2], [[6, 1],1], [[6, 2],0], [[7, 1],2]]

neg list would look similar, with counts next to the elements that have been used

The cutoff is the only variable that can change. So a cutoff of 1, would generate 54 sets I think. A cutoff of two would generate a bunch of combinations while maximizing the variability of the elements used.

Thoughts? I'm not sure where to go with this one.

Solution

Parameters:

good_coords = [(0,0), (1,0)] 
bad_coords = [(0,1), (1,1), (1,2)] 
cutoff = 2

The `x`-indices of your data must be SORTED.

from collections import defaultdict
from itertools import combinations, cycle

# 'xs' stands for 'x-es', plural of 'x'
xs = sorted(list(set(pair[0] for pair in good_coords)
                 .union(set(pair[0] for pair in bad_coords))))

# Define 2 dictionaries. In each one:
# - keys are unique values of `x`, 
# - values are lists of `y` corresponding to that `x`. 
# All good pairs are in `pairs_good`, and all bad ones - in `pairs_bad`:
pairs_good = defaultdict(list, {x:[] for x in xs})
pairs_bad = defaultdict(list, {x:[] for x in xs})
for x, y in good_coords:
    pairs_good[x].append(y)
for x, y in bad_coords:  
    pairs_bad[x].append(y)

# Define a dictionary where:
# Keys will be all integers from 0 to `cutoff`: `n_bad`.
# Values will be lists containing other lists: sequences of `y` -
# all the results corresponding to this `n_bad`:
sequences_for_each_n_bad = {}

# for `n_bad == 0`:
chosen_ys = [cycle(ys) for ys in pairs_good.values()]
sequences = []
maxlen = max(len(pairs) for pairs in pairs_good.values())
seq_iterator = zip(*chosen_ys)
while maxlen > 0:
    sequences.append(next(seq_iterator))
    maxlen -= 1
sequences_for_each_n_bad[0] = sequences

# for all other `n_bad`: >0:
for n_bad in range(1, cutoff + 1):
    sequences = []
    # `good_ys` and `bad_ys` are cyclic infinite iterators
    all_good_ys = [cycle(ys) for ys in pairs_good.values()]
    all_bad_ys = [cycle(ys) for ys in pairs_bad.values()] 
    
    # Go through all the possibilities of bad index locations:
    for bad_ind_locs in combinations(range(len(xs)), n_bad):
        maxlen = max(len(pairs_bad[xs[i]]) for i in bad_ind_locs)
        # Create a list of the CHOSEN cyclic iterators of `y`s -
        # each of them good or bad, depending on `bad_ind_locs`:
        chosen_ys = [bad_ys if idx in bad_ind_locs else good_ys
                     for idx, (good_ys, bad_ys) 
                     in enumerate(zip(all_good_ys, all_bad_ys))]
        # Iterate over all elements of all rows (in parallel),
        # until the last element of the longest bad row is reached:
        seq_iterator = zip(*chosen_ys)
        while maxlen > 0:
            sequences.append(next(seq_iterator))
            maxlen -= 1
    
    sequences_for_each_n_bad[n_bad] = sequences

Result for the inputs above:

sequences_for_each_n_bad

{0: [(0, 0)], 1: [(1, 0), (0, 1), (0, 2)], 2: [(1, 1), (1, 2)]}

The keys of the dictionary sequences_for_each_n_bad are n_bads (then number of bad elements in a sequence). The values are all chosen sequences of ys, for this number of bad elements.

Note that x always has the same values in the same positions (0, 1, 2) -- so it is not necessary to store them in the pairs. I am storing them in a list xs (x-ses) (assuming that your xs are in the ascending order).

If you would like to convert the output into your format, you could use:

aslist = [[list(zip(xs, sequence)) for sequence in sequences]
           for n_bad, sequences 
           in sequences_for_each_n_bad.items()]
[sublist for n_bad_list in aslist
         for sublist in n_bad_list]

[[(0, 0), (1, 0)],
 [(0, 1), (1, 0)],
 [(0, 0), (1, 1)],
 [(0, 0), (1, 2)],
 [(0, 1), (1, 1)],
 [(0, 1), (1, 2)]]

Testing other examples:

good_coords = [(0,0), (1,0), (2,0)] 
bad_coords = [(0,1), (1,1), (2,1), (0,2), (1,2), (2,2)] 
cutoff = 2

{0: [(0, 0, 0)],
 1: [(1, 0, 0), (2, 0, 0), (0, 1, 0), (0, 2, 0), (0, 0, 1), (0, 0, 2)],
 2: [(1, 1, 0), (2, 2, 0), (1, 0, 1), (2, 0, 2), (0, 1, 1), (0, 2, 2)]}

good_coords = [(0,0), (1,0), (2,0), (3,0),(3,3)] 
bad_coords = [(0,1), (1,1),(1,2), (2,1),(2,2),(2,3), (3,1),(3,2)]
cutoff = 2

{0: [(0, 0, 0, 0), (0, 0, 0, 3)],
 1: [(1, 0, 0, 0),
  (0, 1, 0, 3),
  (0, 2, 0, 0),
  (0, 0, 1, 3),
  (0, 0, 2, 0),
  (0, 0, 3, 3),
  (0, 0, 0, 1),
  (0, 0, 0, 2)],
 2: [(1, 1, 0, 0),
  (1, 2, 0, 3),
  (1, 0, 1, 0),
  (1, 0, 2, 3),
  (1, 0, 3, 0),
  (1, 0, 0, 1),
  (1, 0, 0, 2),
  (0, 1, 1, 3),
  (0, 2, 2, 0),
  (0, 1, 3, 3),
  (0, 2, 0, 1),
  (0, 1, 0, 2),
  (0, 0, 1, 1),
  (0, 0, 2, 2),
  (0, 0, 3, 1)]}

Create combinations from two lists based on variable number of items from first list

The x-indices of your data must be SORTED.

Testing other examples:

The `x`-indices of your data must be SORTED.