Search code examples
pandasdataframeaggregatemethod-chaining

How to method chaining in pandas to aggregate a DataFrame?


I want to aggregate a pandas DataFrame using method chaining. I don't know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:

Having this data:

import pandas as pd

my_df = pd.DataFrame({
    'name': ['john', 'diana', 'rachel', 'chris'],
    'favorite_color': ['red', 'blue', 'green', 'red']
})

my_df
#>      name favorite_color
#> 0    john            red
#> 1   diana           blue
#> 2  rachel          green
#> 3   chris            red

and I want to end up with this summary table:

#>    total_people  total_ppl_who_like_red
#> 0             4                       2

Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:

desired_output_via_building_new_df = pd.DataFrame({
    'total_people': [len(my_df)],
    'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
    })

desired_output_via_building_new_df 
#>    total_people  total_ppl_who_like_red
#> 0             4                       2

However, I'm looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of

# pseudo-code; not really working
my_df.agg({
    'total_people': lambda x: len(x),
    'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
    })

I can only borrow inspiration from R/dplyr code:

library(dplyr, warn.conflicts = FALSE)

my_df <- 
  data.frame(name = c("john", "diana", "rachel", "chris"),
             favorite_color = c("red", "blue", "green", "red")
             )

my_df |> 
  summarise(total_people = n(),                                     ## in the context of `summarise()`,
            total_ppl_who_like_red = sum(favorite_color == "red"))  ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#>   total_people total_ppl_who_like_red
#> 1            4                      2

Solution

  • Solution for processing one Series:

    df = my_df.favorite_color.apply({
        'total_people': lambda x: x.count(),
        'total_ppl_who_like_red': lambda x: x.eq('red').sum()
        }).to_frame(name=0).T
    print (df)
       total_people  total_ppl_who_like_red
    0             4                       2
    

    General solution for processing DataFrame with DataFrame.pipe - then pandas processing input DataFrame, if use apply or agg processing columns separately:

    df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x), 
                                         'total_ppl_who_like_red': 
                                            x.favorite_color.eq('red').sum()}))
               .to_frame(name=0).T)
    print (df)
       total_people  total_ppl_who_like_red
    0             4                       2
    

    df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x), 
                               'total_ppl_who_like_red': 
                                   x.favorite_color.eq('red').sum(),
                               'max_age':x.age.max()
                              }).to_frame(name=0).T)
    print (df)
       total_people  total_ppl_who_like_red  max_age
    0             4                       2       41