Search code examples
pythonpython-itertoolsformal-languagessubsequence

Generate all unique k-subsequences


I am trying to write a Python (at least initially) function to generate all subsequences of some length k (where k > 0). Since I only need unique subsequences, I am storing both the subsequences and partial subsequences in sets. The following, adapted from a colleague, is the best I could come up with. It seems...overly complex...and like I should be able to abuse itertools, or recursion, to do what I want to do. Can anyone do better?

from typing import Set, Tuple


def subsequences(string: str, k: int) -> Set[Tuple[str, ...]]:
    if len(string) < k:
        return set()
    start = tuple(string[:k])
    result = {start}
    prev_state = [start]
    curr_state = set()
    for s in string[k:]:
        for p in prev_state:
            for i in range(k):
                new = p[:i] + p[i + 1 :] + (s,)
                curr_state.add(new)
        result.update(curr_state)
        prev_state = list(curr_state)
        curr_state.clear()
    return result

(For context, I am interested in induction of k-strictly piecewise languages, an efficiently learnable subclass of the regular languages, and the grammar can be characterized by all licit k-subsequences.

Ultimately I am also thinking about doing this in C++, where std::make_tuple isn't quite as powerful as Python tuple.)


Solution

  • You want a set of r combinations from n items (w/o replacement, <= (n choose r).

    Given

    import itertools as it
    
    import more_itertools as mit
    

    Code

    Option 1 - itertools.combinations

    set(it.combinations("foo", 2))
    # {('f', 'o'), ('o', 'o')}
    
    set(it.combinations("foobar", 3))
    # {('b', 'a', 'r'),
    #  ('f', 'a', 'r'),
    #  ('f', 'b', 'a'),
    #  ('f', 'b', 'r'),
    #  ('f', 'o', 'a'),
    #  ('f', 'o', 'b'),
    #  ('f', 'o', 'o'),
    #  ('f', 'o', 'r'),
    #  ('o', 'a', 'r'),
    #  ('o', 'b', 'a'),
    #  ('o', 'b', 'r'),
    #  ('o', 'o', 'a'),
    #  ('o', 'o', 'b'),
    #  ('o', 'o', 'r')}
    

    Option 2 - more_itertools.distinct_combinations

    list(mit.distinct_combinations("foo", 2))
    # [('f', 'o'), ('o', 'o')]
    
    list(mit.distinct_combinations("foobar", 3))
    # [('f', 'o', 'o'),
    #  ('f', 'o', 'b'),
    #  ('f', 'o', 'a'),
    #  ('f', 'o', 'r'),
    #  ('f', 'b', 'a'),
    #  ('f', 'b', 'r'),
    #  ('f', 'a', 'r'),
    #  ('o', 'o', 'b'),
    #  ('o', 'o', 'a'),
    #  ('o', 'o', 'r'),
    #  ('o', 'b', 'a'),
    #  ('o', 'b', 'r'),
    #  ('o', 'a', 'r'),
    #  ('b', 'a', 'r')]
    

    Both options yield the same (unordered) output. However:

    • Option 1 takes the set of all combinations (including duplicates)
    • Option 2 does not compute duplicate intermediates

    Install more_itertools via > pip install more_itertools.

    See also a rough implementation of itertools.combinations in written Python.