Search code examples
pythonpermutationglobdna-sequence

Looking for elegant glob-like DNA string expansion


I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.

The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.

For example, say I have the string:

ATMM

I would like to take this string as input and output the four possible matching strings:

ATAA ATAC ATCA ATCC

Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.

Thank you for any advice.

Edit, thanks cortex for the product operator. This is my solution:

Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.

import sys
from itertools import product

baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
                  Y=['C','T'],K=['G','T'],V=['A','C','G'],
                  H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
    strings = [str]

    ## this loop visits very possible base in the dictionary
    ## probably a cleaner way to do it
    for base in baseDict:
        oldstrings = strings
        strings = []
        for string in oldstrings:
            strings += map("".join,product(*[baseDict[base] if x == base 
                                 else [x] for x in string]))
    return strings

for line in sys.stdin.readlines():
    line = line.rstrip('\n')
    permutations = glob(line)
    for x in permutations:
        print x

Solution

  • Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):

    from itertools import product
    map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))
    

    Full solution with input handling:

    import sys
    from itertools import product
    
    base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
                  "S":['C','G'], "Y":['C','T'], "K":['G','T'],
    
                  "V":['A','C','G'], "H":['A','C','T'],
                  "D":['A','G','T'], "B":['C','G','T'],
                  }
    
    def base_glob(glob_sequence):
        production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
        return map("".join, product(*production_sequence))
    
    for line in sys.stdin.readlines():
        productions = base_glob(line.strip())
        print "\n".join(productions)