Search code examples
pythonpython-itertools

Python - get all in order splits of a string


That is, for a sentence break it up into all possible combinations of in-order words, with no words omitted

For example, for input "The cat sat on the mat"

output

[("The", "cat sat on the mat"),
("The cat", "sat on the mat"),  
("The cat", "sat", "on the mat")] #etc

but not

("The mat", "cat sat on the") # out of order
("The cat"), ("mat") # words missing

I looked at methods in itertools, but can't see that they do the job, as combinations will miss items out ("the cat", "mat") and permutations will change the order.

Am I missing something in these tools, or are they just not right ones?

(For the sake of clarity, this is not a question about how to split a string, but how to get combinations)


Solution

  • Modifying Raymond Hettinger's partition recipe for Python 3 as inspired by this blog post from WordAligned, and every partition case with your list, we can accomplish this with chain and combinations from itertools.

    from itertools import chain, combinations
    def partition(iterable):
        n = len(input_list)
        b, mid, e = [0], list(range(1, n)), [n]
        getslice = input_list.__getitem__
        splits = (d for i in range(n) for d in combinations(mid, i))
        return [[input_list[sl] for sl in map(slice, chain(b, d), chain(d, e))]
                for d in splits]
    

    Demo:

    >>> print(partition(input_list))
    [[['The', 'cat', 'sat', 'on', 'the', 'mat']], [['The'], ['cat', 'sat', 'on', 'the', 'mat']], [['The', 'cat'], ['sat', 'on', 'the', 'mat']], [['The', 'cat', 'sat'], ['on', 'the', 'mat']], [['The', 'cat', 'sat', 'on'], ['the', 'mat']], [['The', 'cat', 'sat', 'on', 'the'], ['mat']], [['The'], ['cat'], ['sat', 'on', 'the', 'mat']], [['The'], ['cat', 'sat'], ['on', 'the', 'mat']], [['The'], ['cat', 'sat', 'on'], ['the', 'mat']], [['The'], ['cat', 'sat', 'on', 'the'], ['mat']], [['The', 'cat'], ['sat'], ['on', 'the', 'mat']], [['The', 'cat'], ['sat', 'on'], ['the', 'mat']], [['The', 'cat'], ['sat', 'on', 'the'], ['mat']], [['The', 'cat', 'sat'], ['on'], ['the', 'mat']], [['The', 'cat', 'sat'], ['on', 'the'], ['mat']], [['The', 'cat', 'sat', 'on'], ['the'], ['mat']], [['The'], ['cat'], ['sat'], ['on', 'the', 'mat']], [['The'], ['cat'], ['sat', 'on'], ['the', 'mat']], [['The'], ['cat'], ['sat', 'on', 'the'], ['mat']], [['The'], ['cat', 'sat'], ['on'], ['the', 'mat']], [['The'], ['cat', 'sat'], ['on', 'the'], ['mat']], [['The'], ['cat', 'sat', 'on'], ['the'], ['mat']], [['The', 'cat'], ['sat'], ['on'], ['the', 'mat']], [['The', 'cat'], ['sat'], ['on', 'the'], ['mat']], [['The', 'cat'], ['sat', 'on'], ['the'], ['mat']], [['The', 'cat', 'sat'], ['on'], ['the'], ['mat']], [['The'], ['cat'], ['sat'], ['on'], ['the', 'mat']], [['The'], ['cat'], ['sat'], ['on', 'the'], ['mat']], [['The'], ['cat'], ['sat', 'on'], ['the'], ['mat']], [['The'], ['cat', 'sat'], ['on'], ['the'], ['mat']], [['The', 'cat'], ['sat'], ['on'], ['the'], ['mat']], [['The'], ['cat'], ['sat'], ['on'], ['the'], ['mat']]]