Search code examples
pythonpandasgroup-byconditional-formatting

groupby() with condition - count / mean


im having the dataframe:

test = pd.DataFrame({'Date': [2020 - 12 - 30, 2020 - 12 - 30, 2020 - 12 - 30, 2020 - 12 - 31, 2020 - 12 - 31, 2021 - 0o1 - 0o1, 2021 - 0o1 - 0o1], 'label': ['Positive', 'Positive', 'Negative', 'Negative','Negative', 'Positive', 'Positive'], 'score': [70, 80, 50, 50, 30, 90, 70]})

Output:

   Date         label       score
2020-12-30      Positive    70
2020-12-30      Positive    80
2020-12-30      Negative    50
2020-12-31      Negative    50
2020-12-31      Negative    30
2021-01-01      Positive    90
2021-01-01      Positive    70

My goal is to group by the date and count the labels. In addition the score should be calculating only the mean with the labels/score which are higher at that day. For example if there is more positive than negative at the day it should calculate the mean of the positive scores without the negative scores and the other way around.

I though about the function new_df = test.groupby(['Date', 'label']).agg({'label' : 'count', 'score' : 'mean'})

Output should be like that. Maybe .pivot function would help?

   Date          label     new_score   count_pos  count_neg 
2020-12-30      Positive      150         2          1        
2020-12-31      Negative      80          0          2       
2021-01-01      Positive      160         2          0


new_score = 70 + 80 = 150 of the two days with positive label
count_pos = count "Positive" at day X
count_neg = count "Negative" at day X

  

Im a beginner in python and any help or hints how to tackle this task is appreciated!

Thanks!


Solution

  • Another possible solution, which is based on the following ideas:

    1. Doing a pivot table indexing only on Date and aggregating with sum and count.

    2. Using the pivot table to construct a dataframe with the wanted result.

    aux = df.pivot_table(index=['Date'], columns='label', values='score', aggfunc=[
                         'sum', 'count'], fill_value=0)
    
    
    (pd.DataFrame({'label': aux['sum'].idxmax(axis=1), 
                  'score': aux['sum'].max(axis=1),
                  'count_pos': aux['count']['Positive'],
                  'count_neg': aux['count']['Negative']})
     .reset_index())
    
    

    Output:

             Date     label  score  count_pos  count_neg
    0  2020-12-30  Positive    150          2          1
    1  2020-12-31  Negative     80          0          2
    2  2021-01-01  Positive    160          2          0