Search code examples
pythonlistgenetics

Rearrange strings in list alphabetically and by case


I have a list in a for loop and it uses itertools.product() to find different combinations of letters. I want to use collections.Counter() to count the number of occurrences of an item, however, right now it prints all the different combinations of "A"'s and "G"'s:

['a', 'A', 'G', 'G']
['a', 'A', 'G', 'g']
['a', 'A', 'G', 'G']
['a', 'A', 'G', 'g']
['a', 'A', 'G', 'g']
#...
['a', 'G', 'A', 'G']
['a', 'G', 'a', 'g']
['a', 'G', 'A', 'G']
['a', 'G', 'a', 'G']
['a', 'G', 'a', 'G']
#...
['a', 'G', 'a', 'G']
['a', 'G', 'A', 'G']
['a', 'G', 'a', 'g']
['a', 'G', 'A', 'G']
['a', 'G', 'a', 'G']
#...
['a', 'G', 'A', 'G']
['a', 'G', 'a', 'G']
['a', 'G', 'a', 'G']
# etc.

Now, this isn't all of them, but as you can see, there are some occurrences that are the same although ordered differently, for example:

['a', 'G', 'A', 'G']
['a', 'A', 'G', 'G']

I would much prefer the latter ordering, so I want to find a way to print all of the combinations with capital letters before lower case, and because 'a' is before 'g', also alphabetically. The final product should look like ['AaGG', 'aaGg', etc]. What function or functions should I use?

This is the code that generates the data. The section marked "Counting" is what I'm having trouble with.

import itertools
from collections import Counter
parent1 = 'aaGG'
parent2 = 'AaGg'
f1 = []
f1_ = []
genotypes = []
b = []
genetics = []
g = []
idx = []

parent1 = list(itertools.combinations(parent1, 2))    
del parent1[0]
del parent1[4] 

parent2 = list(itertools.combinations(parent2, 2))    
del parent2[0]
del parent2[4]


for x in parent1:
    f1.append(''.join(x))

for x in parent2:
    f1_.append(''.join(x))

y = list(itertools.product(f1, f1_))  

for x in y:
    genotypes.append(''.join(x))
    break
genotypes = [
        thingies[0][0] + thingies[1][0] + thingies[0][1] + thingies[1][1]
        for thingies in zip(parent1, parent2)
] * 4
print 'F1', Counter(genotypes)

# Counting
for genotype in genotypes:
    alleles = list(itertools.combinations(genotype,2))
    del alleles[1]
    del alleles[3]
    for x in alleles:
        g.append(''.join(x))

for idx in g:
    if idx.lower().count("a") == idx.lower().count("g") == 1:
        break                

f2 = list(itertools.product(g, g)) 

for x in f2:
    genetics.append(''.join(x)) 

for genes in genetics:
    if genes.lower().count("a") == genes.lower().count("g") == 2:
        genes = ''.join(genes)
    print Counter(genes)

Solution

  • I think you're looking for a customized way to define precedence; the lists are currently being ordered by ASCII numbering, which defines uppercase letters as always preceding lowercase letters. I would define customized precedence using a dictionary:

    >>> test_list = ['a', 'A', 'g', 'G']
    >>> precedence_dict = {'A':0, 'a':1, 'G':2,'g':3}
    >>> test_list.sort(key=lambda x: precedence_dict[x])
    >>> test_list
    ['A', 'a', 'G', 'g']
    

    Edit: Your last few lines:

    for genes in genetics:
        if genes.lower().count("a") == genes.lower().count("g") == 2:
            genes = ''.join(genes)
        print Counter(genes)
    

    were not doing what you wanted them to.

    Replace those lines with:

    precedence_dict = {'A':0, 'a':1, 'G':2,'g':3}
    
    for i in xrange(len(genetics)):
        genetics[i] = list(genetics[i])
        genetics[i].sort(key=lambda x: precedence_dict[x])
        genetics[i] = ''.join(genetics[i])
    from sets import Set
    
    genetics = list(Set(genetics))
    genetics.sort()
    
    print genetics
    

    and I think you have the correct solution. When iterating over elements in a for loop, Python makes a copy of the item. So the string 'genes' was actually not being modified in the original list.