python algorithm encoding compression store

Is there a recommended algorithm for compressing multiple specific substrings in a string that resemble DNA?

I am currently in search of a solution that would help me minimize the storage space occupied by specific sets of strings. These individual strings are essentially parts of larger strings. For example, consider the following strings :

b bc b bcd b b bb abc

These strings are substrings of the larger string below:

bcde
bbde
abce

I am seeking a solution to encode these specific strings in a manner that would consume minimal memory resources.

Solution

The data structure you are looking for is trie. Based on the strings you provided:

b bc b bcd b b bb abc

The outputs should be:

bb
abc
bcd

A very naive implementation of the tree data structure looks like this:

class Tree():
    def __init__(self):
        self.firstletter = {}
    def insert(self, word):
        current = self.firstletter
        for l in word:
            current.setdefault(l, {})
            current = current[l]
            
newtree = Tree()
instr = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
_ = [newtree.insert(word) for word in instr]

And you can get all the 'words' out with a depth search:

def get_words(trie, strname):
    if not trie.keys():
        print(strname)
        return
    for n in trie.keys():
        get_words(trie[n], strname + n)
_ = [get_words(val, n) for n, val in newtrie.firstletter.items()]

which gives you the outputs I listed above.

A nicely implemented trie will compress the data further and make searches faster. There are many nicely implemented tries in different languages. Depending on the task you may also be interested in prefix/suffix arrays and FM-Indexes.