Search code examples
pythonpandasdataframegroup-byaggregation

How to aggregate 3 columns in DataFrame to have count and distribution of values in separated columns in Python Pandas?


I have Pandas DataFrame like below:

data types:

  • ID - int

  • TIME - int

  • TG - int

    ID TIME TG
    111 20210101 0
    111 20210201 0
    111 20210301 1
    222 20210101 0
    222 20210201 1
    333 20210201 1

And I need to aggregate above DataFrame so as to know:

  1. how many IDs are per each value in TIME
  2. how many "1" from TG are per each value in TIME
  3. how many "0" from TG are per each value in TIME

So I need to something like below:

TIME     | num_ID | num_1 | num_0
---------|--------|-------|--------
20210101 | 2      | 0     | 2
20210201 | 3      | 2     | 1
20210301 | 1      | 1     | 0

How can I do that in Python Padas ?


Solution

  • Use GroupBy.size for counts TIME values with crosstab for count number of 0 and 1 values:

    df1 = (df.groupby('TIME').size().to_frame('num_ID')
             .join(pd.crosstab(df['TIME'], df['TG']).add_prefix('num_'))
             .reset_index())
    print (df1)
           TIME  num_ID  num_0  num_1
    0  20210101       2      2      0
    1  20210201       3      1      2
    2  20210301       1      0      1
    

    Another idea if need count only 0 and 1 values in GroupBy.agg:

    df1 = (df.assign(num_0 = df['TG'].eq(0),
                    num_1 = df['TG'].eq(1))
            .groupby('TIME').agg(num_ID = ('TG','size'),
                                 num_1=('num_1','sum'),
                                 num_0=('num_0','sum'),
                                 )
            .reset_index()
            )
    print (df1)
           TIME  num_ID  num_1  num_0
    0  20210101       2      0      2
    1  20210201       3      2      1
    2  20210301       1      1      0