Search code examples
pythonpandaspandas-groupby

pandas groupby with length of lists


I need display in dataframe columns both the user_id and length of content_id which is a list object. But struggling to do using groupby. Please help in both groupby as well as my question asked at the bottom of this post (how do I get the results along with user_id in dataframe?)

Dataframe types:

df.dtypes

output:

user_id       object
content_id    object
dtype: object

Sample Data:

    user_id     content_id
0   user_18085  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
1   user_16044  [cont_2738_2_49, cont_4482_2_19, cont_4994_18_...
2   user_13110  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
3   user_18909  [cont_3170_2_28]
4   user_15509  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...

Pandas query:

df.groupby('user_id')['content_id'].count().reset_index()

df.groupby(['user_id'])['content_id'].apply(lambda x: get_count(x))

output:

    user_id     content_id
0   user_10013  1
1   user_10034  1
2   user_10042  1

When I tried without grouping, I am getting fine as below -

df['content_id'].apply(lambda x: len(x))


0       11
1        9
2       11
3        1

But, how do I get the results along with user_id in dataframe? Like I want in below format -

user_id   content_id
some xxx  11
some yyy  6
  

Solution

  • pandas.Groupby returns a grouper element not the contents of each cell. As such it is not possible (without alot of workarounding) to do what you want. Instead you need to simply rewrite the columns (as suggested by @ifly6)

    Using

    df_agg = df.copy()
    df_agg.content_id = df_agg.content_id.apply(len)
    df_agg = df_agg.groupby('user_id').sum()
    

    will result in the same dataframe as the Groupby you described.

    For completeness sake the instruction for a single groupby would be

    df.groupby('user_id').agg(lambda x: x.apply(len).sum())