Search code examples
pythonalgorithmoptimizationdna-sequence

Improving code design of DNA alignment degapping


This is a question regarding a more efficient code design:

Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.

# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.

# Desired output
align = {"seq1":"ATGCATGC", 
         "seq2":"ATGC",
         "seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,1], "gene2":[2,3]},
         "seq3":{"gene1":[0,1], "gene2":[2,3]}}

Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.

1  measure length of any aligned gene  # take any seq, since all seqs aligned
2  list_lengths = list of gene lengths  # order is important
3  for seq in alignment
4      outseq = ""
5      for each num in range(0, length(seq))  # weird for-loop is intentional
6          if seq[num] == "-"
7              current_gene = gene whose start/stop positions include num
8              subtract 1 from length of current_gene
9              subtract 1 from lengths of all genes following current_gene in list_lengths
10         else
11             append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable

Can anyone give me suggestions/examples for a shorter, more direct code design?


Solution

  • This answer handles your updated annos dictionary from the comment to cdlanes answer. That answer leaves the annos dictionary with the incorrect index of [2,1] for seq2 gene2. My proposed solution will remove the gene entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align, then anno[geneX] will have equal indices for start and stop --> See seq3 gene1 from your commented annos.

    align = {"seq1":"ATGCATGC",
             "seq2":"AT----GC",
             "seq3":"A--CA--C"}
    
    annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
             "seq2":{"gene1":[0,3], "gene2":[4,7]},
             "seq3":{"gene1":[0,3], "gene2":[4,7]}}
    
    
    annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
              "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
              "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
    
    import re
    for name,anno in annos.items():
        # indices of gaps removed usinig re
        removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]
    
        # removes gaps from align dictionary
        align[name] = re.sub(r'-', '', align[name])
    
        build_dna = ''
        for gene,inds in anno.items():
    
            start_ind = len(build_dna)+1
    
            #generator to sum the num '-' removed from gene
            num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])
    
            # build the de-gapped string
            build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")
    
            end_ind = len(build_dna)
    
            if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
                del annos[name][gene] #remove the gene entry
                continue
            #update the values in the annos dictionary
            annos[name][gene][0] = start_ind-1
            annos[name][gene][1] = end_ind-1
    

    Results:

    In [3]: annos
    Out[3]:  {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
              'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
              'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
    

    Results from the 3 gene annos above. Just replace the annos variable:

    In [5]: annos3
    Out[5]:  {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
              'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
              'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}