Search code examples
pythonpandasgroup-by

GroupBy, Agreggate Strings and return unique values


How to add a new column of aggregated data

I want to create 03 new columns in a dataframe

Column 01: unique_list

Create a new column in the dataframe of unique values of cfop_code for each key

Column 02: unique_count

A column that check the number of unique values that shows in unique_list

Column 03: not_unique_count

A column that check the number of not unique values that shows in unique_list

example_df

key product cfop_code
0 12345678901234567890 product a 2551
1 12345678901234567890 product b 2551
2 12345678901234567890 product c 3551
3 12345678901234567895 product a 2551
4 12345678901234567897 product b 2551
5 12345678901234567897 product c 2407

Expected Result

key product cfop_code unique_list unique_count not_unique_count
0 12345678901234567890 product a 2551 2251, 3551 2 3
1 12345678901234567890 product b 2551 2251, 3551 2 3
2 12345678901234567890 product c 3551 2251, 3551 2 3
3 12345678901234567895 product a 2551 2251 1 1
4 12345678901234567897 product b 2551 2407, 2551 2 2
5 12345678901234567897 product c 2407 2407, 2551 2 2

What i had tried

Create a list of unique values

df.groupby('key')["cfop"].unique()

key
12345678901234567890    [2551, 3551]
12345678901234567895          [2551]
12345678901234567897    [2551, 2407]
Name: cfop, dtype: object

Getting the count not unique values

df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="size")}).reset_index()

key unique_values
0   12345678901234567890    3
1   12345678901234567895    1
2   12345678901234567897    2

Getting the count unique values into data frame

df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="nunique")}).reset_index()

key unique_values
0   12345678901234567890    2
1   12345678901234567895    1
2   12345678901234567897    2

But FAIL adding a new column

df['unique_list'] = df.groupby('key')["cfop"].unique()
df['unique_count'] = df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="nunique")}).reset_index()
df['not_unique_count'] =df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop', aggfunc="size")}).reset_index()

Solution

  • Try:

    tmp = (
        df.groupby("key")["cfop_code"]
        .agg(
            unique_list = lambda s: sorted(s.unique()), 
            unique_count = "nunique", 
            not_unique_count = "size")
        .reset_index()
    )
    res = df.merge(tmp, on="key")
    
    print(res)
                        key    product  cfop_code   unique_list  unique_count  not_unique_count
    0  12345678901234567890  product a       2551  [2551, 3551]             2                 3
    1  12345678901234567890  product b       2551  [2551, 3551]             2                 3
    2  12345678901234567890  product c       3551  [2551, 3551]             2                 3
    3  12345678901234567895  product a       2551        [2551]             1                 1
    4  12345678901234567897  product b       2551  [2407, 2551]             2                 2
    5  12345678901234567897  product c       2407  [2407, 2551]             2                 2
    
    

    The problem with your attempt is that:

    df.groupby("key").agg(**{"unique_values": pd.NamedAgg(column='cfop_code', aggfunc="nunique")}).reset_index()
    

    returns a DataFrame.You try to assign this whole DataFrame to a new column which fails.