I am trying to write some code to construct a DeBruijn graph from a set of kmers (k letter long strings, DNA sequencing reads) in Python, output as a collection of edges, joining the same node to others.
When I run my code on sample input:
['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
I get:
CAG -> AGG
GAG -> AGG
Instead of:
AGG -> GGG
CAG -> AGG,AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
Any hint of what I am doing wrong?
Here is the code:
import itertools
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
y=[a[1:] for a in inp]
z=[b[:len(b)-1] for b in inp]
y.extend(z)
edjes=list(set(y))
w=[c[1:] for c in edjes]
v=[d[:len(d)-1] for d in edjes]
w.extend(v)
nodes=list(set(w))
graph={}
new=itertools.product(edjes,edjes)
for node in nodes:
for edj in new:
edje1,edje2=edj[0],edj[1]
if edje1[1:]==node and edje2[:len(edje2)-1]==node:
if edje1 in graph:
graph[edje1].append(edje2)
else:
graph[edje1]=[edje2]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
if len(v)<1:
continue
else:
line=k+' -> '+','.join(v)+'\n'
print (line)
I think you make the algorithm much too complicated: you can simply first perform a uniqueness filter on the input:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
And then iterate over this list of "edges". For each edge, the first three characters is the from node, the last three characters are the to node:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
#...
Now you simply need to add this to your graph:
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
And finally perform a sorting and printing like you did yourself:
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
This results in:
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
As you can see, there is a small difference on line 2
: there your expected output contains AGG
two times, which makes not much sense.
So the full algorithm is something like:
inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
edges=list(set(inp))
graph={}
for edge in edges:
frm = edge[:len(edge)-1]
to = edge[1:]
if frm in graph:
graph[frm].append(to)
else:
graph[frm]=[to]
for val in graph.values():
val.sort()
for k,v in sorted(graph.items()):
print(k+' -> '+','.join(v))
A problem I think, is that you consider three letter sequences to be "edjes" (probably edges). The edges are the four sequence characters. By performing this conversion, information is lost. Next you construct a set of two-character items (nodes
which are no nodes at all). They seem to be used to "glue" the nodes together. But at that stage, you do not longer know how the pieces are glued together anyway.