Search code examples
pythonpandasmethod-chaining

chain a sequence of processes in python


I would like to chain a sequence of processes in python. One of theses processes is create some variables and use the groupby function.

Actually, I want create a new dataframe from my original data base. I can do it in some lines, but I would like some more concise using chain. My original data base is 'df'. First, I create a new binary variable indicating if the feature 'var1' has certain propertie: NaN ou non NaN.

data = df
data['aux1'] = data['var1'].map(math.isnan)
data['count'] = 1 
pie = data.groupby(['aux1'])['count'].sum()

In R, I can do something like this:

pie = df %>% select('var1') %>% mutate( aux1 = is.na('var1') , count = 1 ) 
          %>% group_by(aux1) %>% summarise(count = sum('count'))

Is there some chain in python?


Solution

  • You can compare column var1 with Series.isna and for count use Series.value_counts:

    pie = data['var1'].isna().value_counts()
    

    Or create column aux1 by DataFrame.assign and aggregate GroupBy.size, helper column with 1 is not necessary:

    pie = data.assign(aux1=data['var1'].isna()).groupby('aux1').size()
    

    But column count is possible create:

    pie = data.assign(aux1=data['var1'].isna(), count=1).groupby('aux1')['count'].sum()