Search code examples
pythonpython-3.xpandasdataframedata-analysis

Pandas DataFrame find the max after Groupby two columns and get counts


I have a dataframe df as following:

   userId  pageId  tag
0  3122471  e852   18
1  3122471  f3e2   18
2  3122471  7e93   18
3  3122471  2768    6
4  3122471  53d9    6
5  3122471  06d7   15
6  3122471  e31c   15
7  3122471  c6f3    2
8  1234123  fjwe    1
9  1234123  eiae    4
10 1234123  ieha    4

After using df.groupby(['userId', 'tag'])['pageId'].count() to group the data by userId and tag . I will get:

userId   tag
3122471  2      1
         6      2
         15     2
         18     3
1234123   1     1
          4     2

Now I want to find the tag that each user has the most. Just as following:

userId   tag
3122471  18
1234123   4

(Note: if there are multiple tags that has the same count, I want to use a function my_rule to determine which to show)


Solution

  • You could work on aggregated data.

    In [387]: dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()
    
    In [388]: dff
    Out[388]:
        userId  tag  pageId
    0  1234123    1       1
    1  1234123    4       2
    2  3122471    2       1
    3  3122471    6       2
    4  3122471   15       2
    5  3122471   18       3
    
    In [389]: dff.groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()])
    Out[389]:
    userId
    1234123     4
    3122471    18
    dtype: int64