Search code examples
pythondata-structurespermutationlookup

Efficient Look-up for permutations of the start of lists (Python)


Problem: I need to retrieve integer lists from a container while having only a permutation of the lists' start.
Thus, I'm looking for a data-structure/ algorithm that gets an arbitrary length input and returns the corresponding list or an empty one if no matches are found. This needs to be very quick as its called hundreds of times a second on a very large container.

Example case:

{
[1, 15, 30, 20, 45],
[10, 1, 301, 45],
[15, 2, 255, 586, 950, 10124]
}

Input 1:. The Data-structure/algorithm input is [30, 1]
My expected Return: []
→ Empty List, since no matching len(Input)=2 permutation of any contained list is found.
All possible length 2 permutations of the contained list starts are [1, 15], [15, 1], [10, 1], [1, 10], [15, 2], [2, 15] thus not [30, 1].

Input 2: The Data-structure/algorithm input is [30, 1, 15]
Expected Return: [1, 15, 30, 20, 45]
→ [30, 1, 15] can be permuted into [1, 15, 30] which corresponds to the start of the returned list

Supplementary Problem information:
→ Indexes in lists [...] are unique integers (list don't contain the same number twice)
→ Lists have arbitrary but different lengths > 1, up to a few hundred integers.
→ The input length is variable and not guaranteed to exist in the container
→ The container is expected to contain up to several millions of lists
→ The Integers in the lists are actually unbounded, but are expected to be < 1 million

Naive, working but way too slow solution:

Input = sort( Input )
for element in container:
  if ( sort( element[:len(Input)] ) ) == Input ): return ( element )
# cut first Input-length elements and compare sorted
return ( [] )

→ This is infeasible, since the container contains millions of elements and the search is supposed to be called a few hundred times per seconds.


Solution

  • Let the "set hash" of a list be a hash of the list elements that doesn't depend on the order. For example set_hash(L) = hash(sort(L)) works, as does set_hash(L) = sum([hash(x) for x in L]).

    Now, for every list in your data, replace each element with the set hash of the prefix ending at that element. For example [10, 1, 301, 45] would become [set_hash([10]), set_hash([10, 1]), set_hash([10, 1, 301]), set_hash([10, 1, 301, 45])].

    Then, for each distinct set_hash make a list of the indexes at which it occurs. You would map set_hash([10,1]) -> [1], for example, because that hash only occurs in the 2nd data item (and we're counting from 0).

    Now, to query for data items matching L, you just look up the data indexes for set_hash(L), and check each of the corresponding data items to see if it really does match.