Working with suffix trees in python

I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.

Here is my code for generating the string and the suffix tree.

import random

def get_string(length):
    string=""
    for i in range(length):
        string += random.choice("ATGC")
    return string

word=get_string(4**4)+"$"

def suffixtree(string):
    for i in xrange(len(string)):
        if tree.has_key(string[i]):
            tree[string[i]].append([string[i+1:]][0])
        else:
            tree[string[i]]=[string[i+1:]]
    return tree

tree={}
suffixtree(word)

When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.

As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.

Many thanks

Solution

As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this

string[i+1:]

creates an actual (deep) copy of the substring starting at i+1.

If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:

def suffixtree(string):
    N = len(string)
    for i in xrange(N):
        if tree.has_key(string[i]):
            tree[string[i]].append(buffer(string,i+1,N))
        else:
            tree[string[i]]=[buffer(string,i+1,N)]
    return tree

I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.

Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.