Search code examples
pythoncombinationspython-itertools

all combinations of DNA characters in a string of length 4


I am trying to generate a list of all possible DNA sequences of length four with the four character A, T, C, G. There is a total of 4^4 (256) different combinations. I include repeats, such that AAAA is allowed. I have looked at itertools.combinations_with_replacement(iterable, r) however, the list output changes depending on the order of the input string i.e

itertools.combinations_with_replacement('ATCG', 4) #diff results to...
itertools.combinations_with_replacement('ATGC', 4)

Because of this, I had an attempt at combining itertools.combinations_with_replacement(iterable, r), with itertools.permutations()

Such that pass the output of itertools.permutations() to itertools.combinations_with_replacement(). As defined below:

def allCombinations(s, strings):
perms = list(itertools.permutations(s, 4))
allCombos = []
for perm in perms:
    combo = list(itertools.combinations_with_replacement(perm, 4))
    allCombos.append(combo)
for combos in allCombos:
    for tup in combos:
        strings.append("".join(str(x) for x in tup))

However running allCombinations('ATCG', li) where li = [] and then taking the list(set(li)) still only proceeds 136 unique sequences, rather than 256.

There must be an easy way to do this, maybe generating a power set and then filtering for length 4?


Solution

  • You can achieve this by using product. It gives the Cartesian product of the passed iterables:

    a = 'ACTG'
    
    print(len(list(itertools.product(a, a, a, a))))
    # or even better, print(len(list(itertools.product(a, repeat=4)))) as @ayhan commented
    >> 256
    

    But it returns tuples, so if you are looking for strings:

    for output in itertools.product(a, repeat=4):
        print(''.join(output))
    
    >> 'AAAA'
       'AAAC'
       .
       .
       'GGGG'