Search code examples
pythonpython-2.7sequencepython-itertools

How to generate this custom alpha-numeric sequence?


I would like to create a program that generate a particular long 7 characters string.

It must follow this rules:

  1. 0-9 are before a-z which are before A-Z

  2. Length is 7 characters.

  3. Each character must be different from the two close (Example 'NN' is not allowed)

  4. I need all the possible combination incrementing from 0000000 to ZZZZZZZ but not in a random sequence

I have already done it with this code:

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product

chars = digits + ascii_lowercase + ascii_uppercase

for n in range(7, 8):
    for comb in product(chars, repeat=n):
        if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
            print ''.join(comb)

But it is not performant at all because i have to wait a long time before the next combination.

Can someone help me?


Solution

  • Edit: I've updated the solution to use cached short sequences for lengths greater than 4. This significantly speeds up the calculations. With the simple version, it'd take 18.5 hours to generate all sequences of length 7, but with the new method only 4.5 hours.

    I'll let the docstring do all of the talking for describing the solution.

    """
    Problem:
        Generate a string of N characters that only contains alphanumerical
        characters. The following restrictions apply:
            * 0-9 must come before a-z, which must come before A-Z
            * it's valid to not have any digits or letters in a sequence
            * no neighbouring characters can be the same
            * the sequences must be in an order as if the string is base62, e.g.,
              01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc
    
    Solution:
        Implement a recursive approach which discards invalid trees. For example,
        for "---" start with "0--" and recurse. Try "00-", but discard it for
        "01-". The first and last sequences would then be "010" and "ZYZ".
    
        If the previous character in the sequence is a lowercase letter, such as
        in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
        for "9gB-", we should only be working with A-Z.
    
        The input also allows to define a specific sequence to start from. For
        example, for "abGH", each character will have access to a limited set of
        its pool. In this case, the last letter can iterate from H to Z, at which
        point it'll be free to iterate its whole character pool next time around.
    
        When specifying a starting sequence, if it doesn't have enough characters
        compared to `length`, it will be padded to the right with characters free
        to explore their character pool. For example, for length 4, the starting
        sequence "29" will be transformed to "29  ", where we will deal with two
        restricted characters temporarily.
    
        For long lengths the function internally calls a routine which relies on
        fewer recursions and cached results. Length 4 has been chosen as optimal
        in terms of precomputing time and memory demands. Briefly, the sequence is
        broken into a remainder and chunks of 4. For each preceeding valid
        subsequence, all valid following subsequences are fetched. For example, a
        sequence of six would be split into "--|----" and for "fB|----" all
        subsequences of 4 starting A, C, D, etc would be produced.
    
    Examples:
        >>> for i, x in enumerate(generate_sequences(7)):
        ...    print i, x
        0, 0101010
        1, 0101012
        etc
    
        >>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
        ...    print i, x
        0, 012abcAB
        1, 012abcAC
        etc
    
        >>> for i, x in enumerate(generate_sequences(7, 'aB')):
        ...    print i, x
        0, aBABABA
        1, aBABABC
        etc
    """
    
    import string
    
    ALLOWED_CHARS = (string.digits + string.ascii_letters,
                     string.ascii_letters,
                     string.ascii_uppercase,
                     )
    CACHE_LEN = 4
    
    def _generate_sequences(length, sequence, previous=''):
        char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
        if sequence[-length] != ' ':
            char_set = char_set[char_set.find(sequence[-length]):]
            sequence[-length] = ' '
        char_set = char_set.replace(previous, '')
    
        if length == 1:
            for char in char_set:
                yield char
        else:
            for char in char_set:
                for seq in _generate_sequences(length-1, sequence, char):
                    yield char + seq
    
    def _generate_sequences_cache(length, sequence, cache, previous=''):
        sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
        subseq = cache[sublength != CACHE_LEN]
        char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
        if sequence[-length] != ' ':
            char_set = char_set[char_set.find(sequence[-length]):]
            index = len(sequence) - length
            subseq0 = ''.join(sequence[index:index+sublength]).strip()
            sequence[index:index+sublength] = [' '] * sublength
            if len(subseq0) > 1:
                subseq[char_set[0]] = tuple(
                        s for s in subseq[char_set[0]] if s.startswith(subseq0))
        char_set = char_set.replace(previous, '')
    
        if length == CACHE_LEN:
            for char in char_set:
                for seq in subseq[char]:
                    yield seq
        else:
            for char in char_set:
                for seq1 in subseq[char]:
                    for seq2 in _generate_sequences_cache(
                                    length-sublength, sequence, cache, seq1[-1]):
                        yield seq1 + seq2
    
    def precompute(length):
        char_set = ALLOWED_CHARS[0]
        if length > 1:
            sequence = [' '] * length
            result = {}
            for char in char_set:
                result[char] = tuple(char + seq for seq in  _generate_sequences(
                                                         length-1, sequence, char))
        else:
            result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
        return result
    
    def generate_sequences(length, sequence=''):
        # -------------------------------------------------------------------------
        # Error checking: consistency of the value/type of the arguments
        if not isinstance(length, int):
            msg = 'The sequence length must be an integer: {}'
            raise TypeError(msg.format(type(length)))
        if length < 0:
            msg = 'The sequence length must be greater or equal than 0: {}'
            raise ValueError(msg.format(length))
        if not isinstance(sequence, str):
            msg = 'The sequence must be a string: {}'
            raise TypeError(msg.format(type(sequence)))
        if len(sequence) > length:
            msg = 'The sequence has length greater than {}'
            raise ValueError(msg.format(length))
        # -------------------------------------------------------------------------
        if not length:
            yield ''
        else:
            # ---------------------------------------------------------------------
            # Error checking: the starting sequence, if provided, must be valid
            if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
                msg = 'The sequence contains invalid characters: {}'
                raise ValueError(msg.format(sequence))
            if sequence.strip() != sequence.replace(' ', ''):
                msg = 'Uninitiated characters in the middle of the sequence: {}'
                raise ValueError(msg.format(sequence.strip()))
            sequence = sequence.strip()
            if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
                msg = 'No neighbours must be the same character: {}'
                raise ValueError(msg.format(sequence))
            char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
            if char_type != sorted(char_type):
                msg = '0-9 must come before a-z, which must come before A-Z: {}'
                raise ValueError(msg.format(sequence))
            # ---------------------------------------------------------------------
            sequence = list(sequence.ljust(length))
            if length <= CACHE_LEN:
                for s in _generate_sequences(length, sequence):
                    yield s
            else:
                remainder = length % CACHE_LEN
                if not remainder:
                    cache = tuple((precompute(CACHE_LEN),))
                else:
                    cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
                for s in _generate_sequences_cache(length, sequence, cache):
                    yield s
    

    I've included thorough error checks in the generate_sequences() function. For the sake of brevity you can remove them if you can guarantee that whoever calls the function will never do so with invalid input. Specifically, invalid starting sequences.

    Counting number of sequences of specific length

    While the function will sequentially generate the sequences, there is a simple combinatorics calcuation we can perform to compute how many valid sequences exist in total.

    The sequences can effectively be broken down to 3 separate subsequences. Generally speaking, a sequence can contain anything from 0 to 7 digits, followed by from 0 to 7 lowercase letters, followed by from 0 to 7 uppercase letters. As long as the sum of those is 7. This means we can have the partition (1, 3, 3), or (2, 1, 3), or (6, 0, 1), etc. We can use the stars and bars to calculate the various combinations of splitting a sum of N into k bins. There is already an implementation for python, which we'll borrow. The first few partitions are:

    [0, 0, 7]
    [0, 1, 6]
    [0, 2, 5]
    [0, 3, 4]
    [0, 4, 3]
    [0, 5, 2]
    [0, 6, 1]
    ...
    

    Next, we need to calculate how many valid sequences we have within a partition. Since the digit subsequences are independent of the lowercase letters, which are independent of the uppercase letters, we can calculate them individually and multiply them together.

    So, how many digit combinations we can have for a length of 4? The first character can be any of the 10 digits, but the second character has only 9 options (ten minus the one that the previous character is). Similarly for the third letter and so on. So the total number of valid subsequences is 10*9*9*9. Similarly, for length 3 for letters, we get 26*25*25. Overall, for the partition, say, (2, 3, 2), we have 10*9*26*25*25*26*25 = 950625000 combinations.

    import itertools as it
    
    def partitions(n, k):
        for c in it.combinations(xrange(n+k-1), k-1):
            yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]
    
    def count_subsequences(pool, length):
        if length < 2:
            return pool**length
        return pool * (pool-1)**(length-1)
    
    def count_sequences(length):
        counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
                  for i in [10, 26]]
    
        print 'Partition {:>18}'.format('Sequence count')
    
        total = 0
        for a, b, c in partitions(length, 3):
            subtotal = counts[0][a] * counts[1][b] * counts[1][c]
            total += subtotal
            print '{} {:18}'.format((a, b, c), subtotal)
        print '\nTOTAL {:22}'.format(total)
    

    Overall, we observe that while generating the sequences fast isn't a problem, there are so many that it can take a long time. Length 7 has 78550354750 (78.5 billion) valid sequences and this number only scales approximately by a factor of 25 with each incremented length.