Search code examples
permutationrank

Coding the mathematical approach for finding index of a permutation that has repetition


How to calculate the index of an element in the list of strings sorted accordingly to the input alphabet having a given length and a given number of distinct characters.

from itertools import product
def bruteforce_item_index(item, alphabet, length, distinct):
    skipped=0
    for u in product(alphabet, repeat=length):
        v = ''.join(u)
        if v == item:
            return skipped
        if len(set(u)) == distinct:
            skipped += 1

As an example

bruteforce_item_index('0123456777', alphabet='0123456789', length=10, distinct=8)

Runs in ~1 minute and gives the answer 8245410. The run time here is proportional to the index of the given item.

I want an efficient implementation that is able to calculate that index in a fraction of second.

In other words: How to solve this problem? A mathematical approach has been provided on the same page. I want a python or java or c# code as a solution.


Solution

  • In this answer I will explain how to get to a function that will enable you to get the index of an element in the sequence as follows

    print("Item 3749832414 is at (0-based) index %d" %  
         item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
    print("Item 7364512193 is at (0-based) index %d" %  
         item_index('7364512193', alphabet='0123456789', length=10, distinct=8))
    
    > Item 3749832414 is at (0-based) index 508309342
    > Item 7364512193 is at (0-based) index 1005336982
    

    Enumeration method

    By the nature of your problem it is interesting to solve it in a recursive manner, adding digits one by one and keeping track of the number of digits used. Python provide iterators so that you can produce items one by one without storing the whole sequence.

    Basically all the items can be arranged in a prefix tree, and we walk the three yielding the leaf nodes.

    def iter_seq(alphabet, length, distinct, prefix=''):
        if distinct < 0:
            # the prefix used more than the allowed number of distinct digits
            return
        if length == 0:
            # if distinct > 0 it means that prefix did not use
            # enought distinct digits
            if distinct == 0:
                yield prefix
        else:
            for d in alphabet:
                if d in prefix:
                    # the number of distinct digits in prefix + d is the same 
                    # as in prefix.
                    yield from iter_seq(alphabet, length-1, distinct, prefix + d)
                else:
                    # the number of distinct digits in prefix + d is one more 
                    # than the distinct digits in prefix.
                    yield from iter_seq(alphabet, length-1, distinct-1, prefix + d)
    

    Let's test it with examples that can be visualized

    list(iter_seq('0123', 5, 1))
    
    ['00000', '11111', '22222', '33333']
    
    import numpy as np
    np.reshape(list(iter_seq('0123', 4, 2)), (12, 7))
    
    array([['0001', '0002', '0003', '0010', '0011', '0020', '0022'],
           ['0030', '0033', '0100', '0101', '0110', '0111', '0200'],
           ['0202', '0220', '0222', '0300', '0303', '0330', '0333'],
           ['1000', '1001', '1010', '1011', '1100', '1101', '1110'],
           ['1112', '1113', '1121', '1122', '1131', '1133', '1211'],
           ['1212', '1221', '1222', '1311', '1313', '1331', '1333'],
           ['2000', '2002', '2020', '2022', '2111', '2112', '2121'],
           ['2122', '2200', '2202', '2211', '2212', '2220', '2221'],
           ['2223', '2232', '2233', '2322', '2323', '2332', '2333'],
           ['3000', '3003', '3030', '3033', '3111', '3113', '3131'],
           ['3133', '3222', '3223', '3232', '3233', '3300', '3303'],
           ['3311', '3313', '3322', '3323', '3330', '3331', '3332']],
          dtype='<U4')
    

    Counting items

    As you noticed by your previous question, the number of items in a sequence only depends on the length of each string, the size of the alphabet, and the number of distinct symbols.

    If we look to the loop of the above function, we only have two cases, (1) the current digit is in the prefix, (2) the digit is not in the prefix. The number of times the digit will be in the prefix is exactly the number of distinct digits in the prefix. So we can add an argument used to keep track of the number of digits already used instead of the actual prefix. Now the complexity goes from O(length!) to O(2**length). Additionally we use a lru_cache decorator that will memorize the values and return them without calling the function if the arguments are repetaed, this makes the function to run in O(length**2) time and space.

    from functools import lru_cache
    @lru_cache
    def count_seq(n_symbols, length, distinct, used=0):
        if distinct < 0:
            return 0
        if length == 0:
            return 1 if distinct == 0 else 0
        else:
            return \
              count_seq(n_symbols, length-1, distinct-0, used+0) * used + \
              count_seq(n_symbols, length-1, distinct-1, used+1) * (n_symbols - used)
    

    We can that it is consistent with iter_seq

    assert(sum(1 for _ in iter_seq('0123', 4, 2)) == count_seq(4, 4, 2))
    

    We can also test that it aggrees with the example you calculated by hand

    assert(count_seq(10, 10, 8) == 1360800000)
    

    Item at index

    This part is not necessary to get the final answer but it is a good exercise. Furthermore it will give us a way to compute larger sequences that would be tedious by hand.

    This could be achieved by iterating iter_seq the given number of times. This function does that more efficiently by comparing the number of leaves in a given subtree (number of items produced by a specific call) with the distance to the requested index. If the requested index is distant more than the number of items produced by a call it means we can skip that call at all, and jump directly to the next sibling in the tree.

    def item_at(idx, alphabet, length, distinct, used=0, prefix=''):
        if distinct < 0:
            return
        if length == 0:
            return prefix
        else:
            for d in alphabet:
                if d in prefix:
                    branch_count = count_seq(len(alphabet), 
                                             length-1, distinct, used)
                    if branch_count <= idx:
                        idx -= branch_count
                    else:
                        return item_at(idx, alphabet, 
                                       length-1, distinct, used, prefix + d)
                else:
                    branch_count = count_seq(len(alphabet),
                                             length-1, distinct-1, used+1)
                    if branch_count <= idx:
                        idx -= branch_count
                    else:
                        return item_at(idx, alphabet,
                                       length-1, distinct-1, used+1, prefix + d)
    

    We can test that it is consistent with iter_seq

    for i, w in enumerate(iter_seq('0123', 4, 2)):
        assert w == item_at(i, '0123', 4, 2)
    

    Index of a given item

    Remembering that we are walking in a prefix tree, given a string we can walk directly to the desired node. The way to find the index is to sum the size of all the subtrees that are left behind on this path.

    def item_index(item, alphabet, length, distinct, used=0, prefix=''):
        if distinct < 0:
            return 0
        if length == 0:
            return 0
        else:
            offset = 0
            for d in alphabet:
                if d in prefix:
                    if d == item[0]:
                        return offset + item_index(item[1:], alphabet, 
                                   length-1, distinct, used, prefix + d)
                    else:
                        offset += count_seq(len(alphabet), 
                                    length-1, distinct, used)
                else:
                    if d == item[0]:
                        return offset + item_index(item[1:], alphabet, 
                                 length-1, distinct-1, used+1, prefix + d)
                    else:
                        offset += count_seq(len(alphabet), 
                                     length-1, distinct-1, used+1)
    

    And again we can test the consistency between this and iter_seq

    for i,w in enumerate(iter_seq('0123', 4, 2)):
        assert i == item_index(w, '0123', 4, 2)
    

    Or to query for the example numbers you gave as I promised in the beginning of the post

    print("Item 3749832414 is at (0-based) index %d" %  
         item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
    print("Item 7364512193 is at (0-based) index %d" %  
         item_index('7364512193', alphabet='0123456789', length=10, distinct=8))
    
    > Item 3749832414 is at (0-based) index 508309342
    > Item 7364512193 is at (0-based) index 1005336982
    

    Bonus: Larger sequences

    Let's calculate the index of UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0 in the sequences of length 64 and 50 distinct symbols

    item_index('UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0', 
            alphabet='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
            distinct=50, length=64)
    

    Surprisingly it is 10000...000 = 10**110. How could I find that particular string??