Search code examples
pythonnumpypython-itertools

How to retrieve frequencies of the number of unique occurrences of pair letters for every possible pair of columns from a numpy matrix in python


I have a matrix like this using numpy matrix:

>>> print matrix
[['L' 'G' 'T' 'G' 'A' 'P' 'V' 'I']
 ['A' 'A' 'S' 'G' 'P' 'S' 'S' 'G']
 ['A' 'A' 'S' 'G' 'P' 'S' 'S' 'G']
 ['G' 'L' 'T' 'G' 'A' 'P' 'V' 'I']]

What I would like to have is FOR EVERY POSSIBLE pair of columns, retrieve the frequency of the number of unique occurrences of every pair of letters from the row within each pair of columns.

For instance, for the first pair column, that is:

[['L' 'G']
 ['A' 'A']
 ['A' 'A']
 ['G' 'L']]

I would like to retrieve the frequency of every pair of letters within the column (NOTE: the order of letters matters)

Frequency of ['L' 'G'] = 1/4

Frequency of ['A' 'A'] = 2/4

Frequency of ['G' 'L'] = 1/4

Once these frequencies of the first pair column are calculated, then do the same for every other possible pair of columns combination.

I think some kind of itertools would help to solve this question, but I don't know how to... any help would be greatly appreciated


Solution

  • I'd use itertools.combinations and collections.Counter:

    for i, j in itertools.combinations(range(len(s.T)), 2):
        c = s[:, [i,j]]
        counts = collections.Counter(map(tuple,c))
        print 'columns {} and {}'.format(i,j)
        for k in sorted(counts):
            print 'Frequency of {} = {}/{}'.format(k, counts[k], len(c))
        print
    

    produces

    columns 0 and 1
    Frequency of ('A', 'A') = 2/4
    Frequency of ('G', 'L') = 1/4
    Frequency of ('L', 'G') = 1/4
    
    columns 0 and 2
    Frequency of ('A', 'S') = 2/4
    Frequency of ('G', 'T') = 1/4
    Frequency of ('L', 'T') = 1/4
    
    [...]
    

    (Modifying it to do both columns 0 1 and 1 0 if you want both orders is trivial, and I've assumed by every possible pair of columns you don't mean "every adjacent pair of columns").