Search code examples
pythondna-sequence

Algorithm to construct DeBruijn graph gives wrong results


I am trying to write some code to construct a DeBruijn graph from a set of kmers (k letter long strings, DNA sequencing reads) in Python, output as a collection of edges, joining the same node to others.

When I run my code on sample input:

['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']

I get:

CAG -> AGG
GAG -> AGG

Instead of:

AGG -> GGG
CAG -> AGG,AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG

Any hint of what I am doing wrong?
Here is the code:

import itertools

inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']

y=[a[1:] for a in inp]
z=[b[:len(b)-1] for b in inp]

y.extend(z)
edjes=list(set(y))

w=[c[1:] for c in edjes]
v=[d[:len(d)-1] for d in edjes]

w.extend(v)

nodes=list(set(w))

graph={}


new=itertools.product(edjes,edjes)

for node in nodes:
    for edj in new:
        edje1,edje2=edj[0],edj[1]
        if edje1[1:]==node and edje2[:len(edje2)-1]==node:
            if edje1 in graph:
                graph[edje1].append(edje2)
            else:
                graph[edje1]=[edje2]

for val in graph.values():
    val.sort()


for k,v in sorted(graph.items()):
    if len(v)<1:
        continue
   else:
        line=k+' -> '+','.join(v)+'\n'
   print (line)

Solution

  • I think you make the algorithm much too complicated: you can simply first perform a uniqueness filter on the input:

    inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
    
    edges=list(set(inp))
    

    And then iterate over this list of "edges". For each edge, the first three characters is the from node, the last three characters are the to node:

    for edge in edges:
        frm = edge[:len(edge)-1]
        to = edge[1:]
        #...
    

    Now you simply need to add this to your graph:

    for edge in edges:
        frm = edge[:len(edge)-1]
        to = edge[1:]
        if frm in graph:
            graph[frm].append(to)
        else:
            graph[frm]=[to]
    

    And finally perform a sorting and printing like you did yourself:

    for val in graph.values():
        val.sort()
    
    for k,v in sorted(graph.items()):
        print(k+' -> '+','.join(v))
    

    This results in:

    AGG -> GGG
    CAG -> AGG
    GAG -> AGG
    GGA -> GAG
    GGG -> GGA,GGG
    

    As you can see, there is a small difference on line 2: there your expected output contains AGG two times, which makes not much sense.

    So the full algorithm is something like:

    inp=['GAGG','CAGG','GGGG','GGGA','CAGG','AGGG','GGAG']
    
    edges=list(set(inp))
    
    graph={}
    
    for edge in edges:
        frm = edge[:len(edge)-1]
        to = edge[1:]
        if frm in graph:
            graph[frm].append(to)
        else:
            graph[frm]=[to]
    
    for val in graph.values():
        val.sort()
    
    for k,v in sorted(graph.items()):
        print(k+' -> '+','.join(v))
    

    Your algorithm

    A problem I think, is that you consider three letter sequences to be "edjes" (probably edges). The edges are the four sequence characters. By performing this conversion, information is lost. Next you construct a set of two-character items (nodes which are no nodes at all). They seem to be used to "glue" the nodes together. But at that stage, you do not longer know how the pieces are glued together anyway.