Search code examples
pythonalgorithmencodingcompressionstore

Is there a recommended algorithm for compressing multiple specific substrings in a string that resemble DNA?


I am currently in search of a solution that would help me minimize the storage space occupied by specific sets of strings. These individual strings are essentially parts of larger strings. For example, consider the following strings :

b bc b bcd b b bb abc 

These strings are substrings of the larger string below:

bcde
bbde
abce

I am seeking a solution to encode these specific strings in a manner that would consume minimal memory resources.


Solution

  • The data structure you are looking for is trie. Based on the strings you provided:

    b bc b bcd b b bb abc

    The outputs should be:

    bb
    abc
    bcd
    

    A very naive implementation of the tree data structure looks like this:

    class Tree():
        def __init__(self):
            self.firstletter = {}
        def insert(self, word):
            current = self.firstletter
            for l in word:
                current.setdefault(l, {})
                current = current[l]
                
    newtree = Tree()
    instr = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
    _ = [newtree.insert(word) for word in instr]
    

    And you can get all the 'words' out with a depth search:

    def get_words(trie, strname):
        if not trie.keys():
            print(strname)
            return
        for n in trie.keys():
            get_words(trie[n], strname + n)
    _ = [get_words(val, n) for n, val in newtrie.firstletter.items()]
    

    which gives you the outputs I listed above.

    A nicely implemented trie will compress the data further and make searches faster. There are many nicely implemented tries in different languages. Depending on the task you may also be interested in prefix/suffix arrays and FM-Indexes.