I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.
Here is my code for generating the string and the suffix tree.
import random
def get_string(length):
string=""
for i in range(length):
string += random.choice("ATGC")
return string
word=get_string(4**4)+"$"
def suffixtree(string):
for i in xrange(len(string)):
if tree.has_key(string[i]):
tree[string[i]].append([string[i+1:]][0])
else:
tree[string[i]]=[string[i+1:]]
return tree
tree={}
suffixtree(word)
When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.
As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.
Many thanks
As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this
string[i+1:]
creates an actual (deep) copy of the substring starting at i+1
.
If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:
def suffixtree(string):
N = len(string)
for i in xrange(N):
if tree.has_key(string[i]):
tree[string[i]].append(buffer(string,i+1,N))
else:
tree[string[i]]=[buffer(string,i+1,N)]
return tree
I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.
Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer
type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.