Search code examples
numpypython-itertoolsrecarray

Frequency count using itertools.groupby() with recarray


The code goes something like this:

>>>data = pd.DataFrame({'P': ['p1', 'p1', 'p2'],
                        'Q': ['q1', 'q2', 'q1'],
                        'R': ['r1', 'r1', 'r2']})

>>>data

  P  Q  R
0 p1 q1 r1
1 p1 q2 r1
2 p2 q1 r2

>>>data.groupby(['R'] + ['P','Q']).size().unstack(['P','Q'])

After reindexing and fillna(0) it gives the following result:

P  p1      p2
Q  q1  q2  q1  q2
R
r1  1   1   0   0
r2  0   0   1   0

I wanted to do the same with recarray so I imported itertools and tried the following:

>>>data = np.array([('p1', 'p1', 'p2'), ('q1', 'q2', 'q1'), ('r1', 'r1', 'r2')], 
                     dtype=[('P',object),('Q',object),('R',object)]).view(np.recarray)

>>>groupby(data,key = (['R']+['P','Q'])).size().unstack(['P','Q'])

It doesn't work. How do I achieve a similar result without using pandas?


Solution

  • Let's back away from the fancy recarray and object type. It doesn't buy us anything.

    The data can be a simple 2d array of strings:

    In [711]: data = np.array([('p1', 'p1', 'p2'), ('q1', 'q2', 'q1'), ('r1', 'r1', 'r2')])
    In [712]: data
    Out[712]: 
    array([['p1', 'p1', 'p2'],
           ['q1', 'q2', 'q1'],
           ['r1', 'r1', 'r2']], 
          dtype='<U2')
    

    Better yet, make it a list of lists:

    In [713]: data.tolist()
    Out[713]: [['p1', 'p1', 'p2'], ['q1', 'q2', 'q1'], ['r1', 'r1', 'r2']]
    

    intertools.group is designed to work with lists. It can operate on arrays simply because it can iterate on them.

    Explain how you want to group these strings.

    The pandas group by expression is not self explanatory.

    If I simply flatten the data array, I can group sequential values and count them:

    In [726]: data.ravel()
    Out[726]: 
    array(['p1', 'p1', 'p2', 'q1', 'q2', 'q1', 'r1', 'r1', 'r2'], 
          dtype='<U2')
    In [727]: g=itertools.groupby(data.ravel())
    In [728]: [(k,list(v)) for k,v in g]
    Out[728]: 
    [('p1', ['p1', 'p1']),
     ('p2', ['p2']),
     ('q1', ['q1']),
     ('q2', ['q2']),
     ('q1', ['q1']),
     ('r1', ['r1', 'r1']),
     ('r2', ['r2'])]
    In [729]: g=itertools.groupby(data.ravel())
    In [730]: [(k,len(list(v))) for k,v in g]
    Out[730]: [('p1', 2), ('p2', 1), ('q1', 1), ('q2', 1), ('q1', 1), ('r1', 2), ('r2', 1)]
    

    =============

    Extending my answer to work row-wise

    In [738]: grps = [itertools.groupby(row) for row in data]
    In [739]: [[(k, len(list(v))) for k,v in r] for r in grps]
    [[('p1', 2), ('p2', 1)],
     [('q1', 1), ('q2', 1), ('q1', 1)],
     [('r1', 2), ('r2', 1)]]
    

    This works for the object recarray version of data as well.

    Oops - I misunderstood your 'row-wise' description. Even rereading your last comment I don't understand what you want. It doesn't sound like a itertools.groupby problem at all. I thought you were counting strings like 'r1' and 'q2'. Apparently that's not the case.

    ====================

    OK, a more focused attempt to recreate the pandas table

    Use itertools.product to generate 8 combinations of these 6 strings:

    In [847]: pos = list(product(['r1','r2'],['p1','p2'],['q1','q2']))
    In [848]: pos
    Out[848]: 
    [('r1', 'p1', 'q1'),
     ('r1', 'p1', 'q2'),
     ('r1', 'p2', 'q1'),
     ('r1', 'p2', 'q2'),
     ('r2', 'p1', 'q1'),
     ('r2', 'p1', 'q2'),
     ('r2', 'p2', 'q1'),
     ('r2', 'p2', 'q2')]
    

    convert the dataframe to a list of lists:

    In [849]: val=data.values[:,[2,0,1]].tolist()
    In [850]: val
    Out[850]: [['r1', 'p1', 'q1'], ['r1', 'p1', 'q2'], ['r2', 'p2', 'q1']]
    

    find which of the possible combinations are found in vals:

    In [852]: [[i, list(i) in val] for i in pos]
    Out[852]: 
    [[('r1', 'p1', 'q1'), True],
     [('r1', 'p1', 'q2'), True],
     [('r1', 'p2', 'q1'), False],
     [('r1', 'p2', 'q2'), False],
     [('r2', 'p1', 'q1'), False],
     [('r2', 'p1', 'q2'), False],
     [('r2', 'p2', 'q1'), True],
     [('r2', 'p2', 'q2'), False]]
    

    Rework the 'counts' as a 2x8 0/1 array:

    In [853]: np.array([[list(i) in val] for i in pos]).reshape(2,-1).astype(int)
    Out[853]: 
    array([[1, 1, 0, 0],
           [0, 0, 1, 0]])