Search code examples
pythonalgorithmtreetrie

Returning the most similar bit signature from a prefix tree in python


I never coded with python before (I'm a java programmer) and I'm looking at code that says it returns the most similar bit signature/vector in a prefix tree. The signature could be for example like this "1001". Can someone please explain to me how the code works? How does it iterate over the prefix tree in order to find the most similar/closest signature to a query signature in the tree? Similarity is based on hamming distance.

Here's the code:

class SignatureTrie:
    @staticmethod
    def getNearestSignatureKey(trie, signature):
        digitReplacement = {'0': '1', '1': '0'}
        targetKey, iteratingKey = signature.to01(), ''
        for i in range(len(targetKey)):
            iteratingKey+=targetKey[i]
            if not trie.has_prefix(iteratingKey): iteratingKey=iteratingKey[:-1]+digitReplacement[targetKey[i]]
        return iteratingKey

Here's the source file: https://github.com/kykamath/streaming_lsh/blob/master/streaming_lsh/classes.py

Edit:

I will give an example of what "I'm" expecting the code to do. I don't know if the code actually does that or how it's doing so. That's why I'm asking for an interpretation of the code, especially traversing the prefix tree.

Suppose I have the following prefix tree that contain three strings/signatures: s1 = 1110 s2 = 1100 s3 = 1001

enter image description here

Suppose I have the input signature s = 1000. Now I want to get to know which vector in the prefix/trie is the most similar to the input vector s. Since s3 have the least hamming distance (1), I'm expecting the code to return the vector s3.

What I need is someone to explain to me if the code is doing what I'm expecting it to do and if so, how it's getting the most similar signature i.e. how it's traversing the tree.

If the code is not doing what I'm expecting, can someone please explain what it does giving the example I provided?


Solution

  • The code snippet you posted does not do any of the Trie search. At all.

    The function you are looking at permutes the given signature key (a string of zeros and ones) to find the nearest match; if no matches starting with the first character of the signature are found, it looks for items with the inverse value instead.

    For your sample data, if you were to look for the signature 1101, there are no exact matches. But the trie prefix search will return results for a search for 1, for 11, and for 110. The search for 1101 fails, so digitReplacement is used to replace the last 1 with a 0, which does match, so 1100 is the result of getNearestSignatureKey() function..

    To find matches, prefix matching is delegated to the trie object. This datatype is taken from the Biopython project, and is coded entirely in C; if you are so inclined, study the Trie_has_prefix function to see how that type searches for matching prefixes.

    Documentation on that data type is sparse; the best we have is this autogenerated module page:

    This module implements a trie data structure. This allows an O(M) lookup of a string in a dictionary, where M is the length of the string. It also supports approximate matches.