Search code examples
python-2.7cartesian-productradix-tree

Reversing (or simplifying) a Cartesian product?


To make things easier but also more complicated, I tried to implement a concept of "combined / concise tags" that expand further on into multiple basic tag forms.

In this case the tags consist of (one or more) "sub-tag(s)", delimited by semicolons:

food:fruit:apple:sour/sweet

drink:coffee/tea:hot/cold

wall/bike:painted:red/blue

Slashes indicate "sub-tag" interchangeability. Therefore the interpreter translates them to this:

food:fruit:apple:sour
food:fruit:apple:sweet

drink:coffee:hot
drink:coffee:cold
drink:tea:hot
drink:tea:cold

wall:painted:red
wall:painted:blue
bike:painted:red
bike:painted:blue

The code used (not perfect, but works):

import itertools

def slash_split_tag(tag):
    if not '/' in tag:
        return tag
    subtags = tag.split(':')
    pattern, v_pattern = (), ()
    for subtag in subtags:
        if '/' in subtag:
            pattern += (None,)
            v_pattern += (tuple(subtag.split('/')),)
        else:
            pattern += (subtag,)
    def merge_pattern_and_product(pattern, product):
        ret = list(pattern)
        for e in product:
            ret[ret.index(None)] = e
        return ret
    CartesianProduct = tuple(itertools.product(*v_pattern)) # http://stackoverflow.com/a/170248
    return [ ':'.join(merge_pattern_and_product(pattern, product)) for product in CartesianProduct ]

#===============================================================================
# T E S T
#===============================================================================

for tag in slash_split_tag('drink:coffee/tea:hot/cold'):
    print tag
print
for tag in slash_split_tag('A1/A2:B1/B2/B3:C1/C2:D1/D2/D3/D4/EE'):
    print tag

Question: How can I possibly revert this process? I need this for readability reasons.


Solution

  • Here's a simple, first-pass attempt at such a function:

    def compress_list(alist):
        """Compress a list of colon-separated strings into a more compact
        representation.
        """
        components = [ss.split(':') for ss in alist]
    
        # Check that every string in the supplied list has the same number of tags
        tag_counts = [len(cc) for cc in components]
        if len(set(tag_counts)) != 1:
            raise ValueError("Not all of the strings have the same number of tags")
    
        # For each component, gather a list of all the applicable tags. The set
        # at index k of tag_possibilities is all the possibilities for the
        # kth tag
        tag_possibilities = list()
        for tag_idx in range(tag_counts[0]):
            tag_possibilities.append(set(cc[tag_idx] for cc in components))
    
        # Now take the list of tags, and turn them into slash-separated strings
        tag_possibilities_strs = ['/'.join(tt) for tt in tag_possibilities]
    
        # Finally, stitch this together with colons
        return ':'.join(tag_possibilities_strs)
    

    Hopefully the comments are sufficient in explaining how it works. A few caveats, however:

    • It doesn't do anything sensible such as escaping backslashes if it finds them in the list of tags.

    • This doesn't recognise if there's a more subtle division going on, or if it gets an incomplete list of tags. Consider this example:

      fish:cheese:red
      chips:cheese:red
      fish:chalk:red
      

      It won't realise that only cheese has both fish and chips, and will instead collapse this to fish/chips:cheese/chalk:red.

    • The order of the tags in the finished string is random (or at least, I don't think it has anything to do with the order of the strings in the given list). You could sort tt before you join it with slashes if that's important.

    Testing with the three lists given in the question seems to work, although as I said, the order may be different to the initial strings:

    food:fruit:apple:sweet/sour
    drink:tea/coffee:hot/cold
    wall/bike:painted:blue/red