Coding the mathematical approach for finding index of a permutation that has repetition

How to calculate the index of an element in the list of strings sorted accordingly to the input alphabet having a given length and a given number of distinct characters.

from itertools import product
def bruteforce_item_index(item, alphabet, length, distinct):
    skipped=0
    for u in product(alphabet, repeat=length):
        v = ''.join(u)
        if v == item:
            return skipped
        if len(set(u)) == distinct:
            skipped += 1

As an example

bruteforce_item_index('0123456777', alphabet='0123456789', length=10, distinct=8)

Runs in ~1 minute and gives the answer 8245410. The run time here is proportional to the index of the given item.

I want an efficient implementation that is able to calculate that index in a fraction of second.

In other words: How to solve this problem? A mathematical approach has been provided on the same page. I want a python or java or c# code as a solution.

Solution

In this answer I will explain how to get to a function that will enable you to get the index of an element in the sequence as follows

print("Item 3749832414 is at (0-based) index %d" %  
     item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
print("Item 7364512193 is at (0-based) index %d" %  
     item_index('7364512193', alphabet='0123456789', length=10, distinct=8))

> Item 3749832414 is at (0-based) index 508309342
> Item 7364512193 is at (0-based) index 1005336982

Enumeration method

By the nature of your problem it is interesting to solve it in a recursive manner, adding digits one by one and keeping track of the number of digits used. Python provide iterators so that you can produce items one by one without storing the whole sequence.

Basically all the items can be arranged in a prefix tree, and we walk the three yielding the leaf nodes.

def iter_seq(alphabet, length, distinct, prefix=''):
    if distinct < 0:
        # the prefix used more than the allowed number of distinct digits
        return
    if length == 0:
        # if distinct > 0 it means that prefix did not use
        # enought distinct digits
        if distinct == 0:
            yield prefix
    else:
        for d in alphabet:
            if d in prefix:
                # the number of distinct digits in prefix + d is the same 
                # as in prefix.
                yield from iter_seq(alphabet, length-1, distinct, prefix + d)
            else:
                # the number of distinct digits in prefix + d is one more 
                # than the distinct digits in prefix.
                yield from iter_seq(alphabet, length-1, distinct-1, prefix + d)

Let's test it with examples that can be visualized

list(iter_seq('0123', 5, 1))

['00000', '11111', '22222', '33333']

import numpy as np
np.reshape(list(iter_seq('0123', 4, 2)), (12, 7))

array([['0001', '0002', '0003', '0010', '0011', '0020', '0022'],
       ['0030', '0033', '0100', '0101', '0110', '0111', '0200'],
       ['0202', '0220', '0222', '0300', '0303', '0330', '0333'],
       ['1000', '1001', '1010', '1011', '1100', '1101', '1110'],
       ['1112', '1113', '1121', '1122', '1131', '1133', '1211'],
       ['1212', '1221', '1222', '1311', '1313', '1331', '1333'],
       ['2000', '2002', '2020', '2022', '2111', '2112', '2121'],
       ['2122', '2200', '2202', '2211', '2212', '2220', '2221'],
       ['2223', '2232', '2233', '2322', '2323', '2332', '2333'],
       ['3000', '3003', '3030', '3033', '3111', '3113', '3131'],
       ['3133', '3222', '3223', '3232', '3233', '3300', '3303'],
       ['3311', '3313', '3322', '3323', '3330', '3331', '3332']],
      dtype='<U4')

Counting items

As you noticed by your previous question, the number of items in a sequence only depends on the length of each string, the size of the alphabet, and the number of distinct symbols.

If we look to the loop of the above function, we only have two cases, (1) the current digit is in the prefix, (2) the digit is not in the prefix. The number of times the digit will be in the prefix is exactly the number of distinct digits in the prefix. So we can add an argument used to keep track of the number of digits already used instead of the actual prefix. Now the complexity goes from O(length!) to O(2**length). Additionally we use a lru_cache decorator that will memorize the values and return them without calling the function if the arguments are repetaed, this makes the function to run in O(length**2) time and space.

from functools import lru_cache
@lru_cache
def count_seq(n_symbols, length, distinct, used=0):
    if distinct < 0:
        return 0
    if length == 0:
        return 1 if distinct == 0 else 0
    else:
        return \
          count_seq(n_symbols, length-1, distinct-0, used+0) * used + \
          count_seq(n_symbols, length-1, distinct-1, used+1) * (n_symbols - used)

We can that it is consistent with iter_seq

assert(sum(1 for _ in iter_seq('0123', 4, 2)) == count_seq(4, 4, 2))

We can also test that it aggrees with the example you calculated by hand

assert(count_seq(10, 10, 8) == 1360800000)

Item at index

This part is not necessary to get the final answer but it is a good exercise. Furthermore it will give us a way to compute larger sequences that would be tedious by hand.

This could be achieved by iterating iter_seq the given number of times. This function does that more efficiently by comparing the number of leaves in a given subtree (number of items produced by a specific call) with the distance to the requested index. If the requested index is distant more than the number of items produced by a call it means we can skip that call at all, and jump directly to the next sibling in the tree.

def item_at(idx, alphabet, length, distinct, used=0, prefix=''):
    if distinct < 0:
        return
    if length == 0:
        return prefix
    else:
        for d in alphabet:
            if d in prefix:
                branch_count = count_seq(len(alphabet), 
                                         length-1, distinct, used)
                if branch_count <= idx:
                    idx -= branch_count
                else:
                    return item_at(idx, alphabet, 
                                   length-1, distinct, used, prefix + d)
            else:
                branch_count = count_seq(len(alphabet),
                                         length-1, distinct-1, used+1)
                if branch_count <= idx:
                    idx -= branch_count
                else:
                    return item_at(idx, alphabet,
                                   length-1, distinct-1, used+1, prefix + d)

We can test that it is consistent with iter_seq

for i, w in enumerate(iter_seq('0123', 4, 2)):
    assert w == item_at(i, '0123', 4, 2)

Index of a given item

Remembering that we are walking in a prefix tree, given a string we can walk directly to the desired node. The way to find the index is to sum the size of all the subtrees that are left behind on this path.

def item_index(item, alphabet, length, distinct, used=0, prefix=''):
    if distinct < 0:
        return 0
    if length == 0:
        return 0
    else:
        offset = 0
        for d in alphabet:
            if d in prefix:
                if d == item[0]:
                    return offset + item_index(item[1:], alphabet, 
                               length-1, distinct, used, prefix + d)
                else:
                    offset += count_seq(len(alphabet), 
                                length-1, distinct, used)
            else:
                if d == item[0]:
                    return offset + item_index(item[1:], alphabet, 
                             length-1, distinct-1, used+1, prefix + d)
                else:
                    offset += count_seq(len(alphabet), 
                                 length-1, distinct-1, used+1)

And again we can test the consistency between this and iter_seq

for i,w in enumerate(iter_seq('0123', 4, 2)):
    assert i == item_index(w, '0123', 4, 2)

Or to query for the example numbers you gave as I promised in the beginning of the post

print("Item 3749832414 is at (0-based) index %d" %  
     item_index('3749832414', alphabet='0123456789', length=10, distinct=8))
print("Item 7364512193 is at (0-based) index %d" %  
     item_index('7364512193', alphabet='0123456789', length=10, distinct=8))

> Item 3749832414 is at (0-based) index 508309342
> Item 7364512193 is at (0-based) index 1005336982

Bonus: Larger sequences

Let's calculate the index of UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0 in the sequences of length 64 and 50 distinct symbols

item_index('UCP3gzjGPMwjYbYtsFu2sDHRE14XTu8AdaWoJPOm50YZlqI6skNyfvEShdmGEiB0', 
        alphabet='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
        distinct=50, length=64)

Surprisingly it is 10000...000 = 10**110. How could I find that particular string??