Search code examples
pythonnumpypython-itertools

Get all combinations of a word with repeat - python itertools.product too slow


I have an array value that I need to get all possible combinations. Which using itertools.product does easily.

Eg. apple could be elppa, appel, lppae etc.

However the caveat is two fold.

I need to get all combinations of this word with the letters repeated 30 times. Eg. aaaaaaaaaaaaaaaaaaaaaaaaaapple , appleaaaaaaaaaaaaaaaaaaaaapple

Obviously we're working with a gigantic data source here, so when I run tests with eg 6-10 repeats it's reasonable fast (ie. under a minute). When running a test overnight to the power of 30, it suggests the test will take days to complete.

I have played with Numpy, which has commonly been suggested on StackOverflow as a faster/lighter method to use. But I have come unstuck on this, as all the variations I have found, have resulted in scripts killing my machine and using disk space, compared to the slow (too slow for this test), but more efficient itertools.product.

Also I am not understanding how you can pull all this data into a numty array to then calculate the following, without the overhead on the system.

Ultimately.

The point of the exercise is to count how many times the word apple appears in each row of results. But only when it appears once in a row. This would count: aaaaaaaaaaaaaaaaaaaaaaaaaapple This would not: appleaaaaaaaaaaaaaaaaaaaaapple

The below code works without too much strain on the machine, but runs too slowly.

Thanks

import itertools
import time
import numpy as np

apple = ['a','p','l','e']
occurences = 0
line = 0
arr_len = len(apple)
length = 30
squared = arr_len**length

start_time = time.time()

for string in itertools.imap(''.join, itertools.product(apple, repeat=length)):
    line += 1
    if (string.count('apple')==1):
        occurences += 1
        if occurences % 100000 == 0:
            print occurences, ("--- %s seconds ---" % (time.time() - start_time)),squared, line

print ('Occurences : ',occurences)
print ('Last line no. ',line)  
print ("--- %s seconds ---" % (time.time() - start_time))

Solution

  • The way you're trying to solve the problem is inherently exponential. You need to use dynamic programming. This problem has a polynomial time solution. If your word has n characters in it, you can use a Markov chain with 2n states.

    import numpy as np
    
    word = 'papal'
    length = 10
    
    word_chars = list(set(word))
    n = len(word)
    m = len(word_chars)
    states = [0] * (2*n)
    states[0] = 1
    jumps = np.zeros((n, m), dtype=np.int)
    for i in range(n):
        for j in range(m):
            # We've seen the first i characters of word, and we see character word_chars[j]
            if word[i] == word_chars[j]:
                value = i+1
            else:
                for k in range(i+1):
                    if word[k: i] + word_chars[j] == word[:i - k + 1]:
                        value = i - k + 1
                        break
                else:
                    value = 0
            jumps[i, j] = value
    
    for i in range(length):
       new_states = [0] * (2*n)
        for j in range(n):
            for jump in jumps[j]:
                new_states[jump] += states[j]
                if n+jump < 2*n:
                    new_states[n+jump] += states[n+j]
        states = new_states
    
    print(np.sum(states[n:]))
    

    If the word is "papa" then does "papapa" match or not? If not, you should delete states in the Markov chain.