I am trying to generate a list of all possible DNA sequences of length four with the four character A
, T
, C
, G
. There is a total of 4^4 (256) different combinations. I include repeats, such that AAAA
is allowed.
I have looked at itertools.combinations_with_replacement(iterable, r)
however, the list output changes depending on the order of the input string i.e
itertools.combinations_with_replacement('ATCG', 4) #diff results to...
itertools.combinations_with_replacement('ATGC', 4)
Because of this, I had an attempt at combining itertools.combinations_with_replacement(iterable, r)
, with itertools.permutations()
Such that pass the output of itertools.permutations()
to itertools.combinations_with_replacement()
. As defined below:
def allCombinations(s, strings):
perms = list(itertools.permutations(s, 4))
allCombos = []
for perm in perms:
combo = list(itertools.combinations_with_replacement(perm, 4))
allCombos.append(combo)
for combos in allCombos:
for tup in combos:
strings.append("".join(str(x) for x in tup))
However running allCombinations('ATCG', li)
where li = []
and then taking the
list(set(li))
still only proceeds 136 unique sequences, rather than 256.
There must be an easy way to do this, maybe generating a power set and then filtering for length 4?
You can achieve this by using product
. It gives the Cartesian product of the passed iterables:
a = 'ACTG'
print(len(list(itertools.product(a, a, a, a))))
# or even better, print(len(list(itertools.product(a, repeat=4)))) as @ayhan commented
>> 256
But it returns tuples, so if you are looking for strings:
for output in itertools.product(a, repeat=4):
print(''.join(output))
>> 'AAAA'
'AAAC'
.
.
'GGGG'