Search code examples
pythonmean

How to obtain the mean value when the value occurs at least triplicate in the data? (Python)


I have a dataframe where based on the occurence of a value in column 'hit_num', I need to get the mean value of the column 'scores'. Below is an example of random data I created for this question:

# set random seed for reproducibility
np.random.seed(42)

# crate random pandas dataframe
df = pd.DataFrame({'hit_num': np.random.randint(1,5,20), 
                   'score': np.random.randint(1, 500, 20),
                   })   
df.sort_values(by='hit_num')

Thus, if a value occurs at least three times in the hit_num column. I need to create a new dataframe where with the hit_num value and the average of the scores. examlpe of the data I generated to create an idea Thus hit_num 1 should be skipped, hit_num 2 should get mean of the 4 scores and hit_num 3 should get the mean of 7 scores.

I tried to work with np.mean() or with pandas mean, but I realized that it would just give the mean of the whole column. Next, I searched for column/row specific. I found something here on stackoverflow that showed how to get the mean of specific column/row combinations:

df[['b', 'c']].iloc[[2,4]].mean(axis=0)

but this was for a pre-set rowcolumn combination and I need it randomly to work. I therefore thought I need something like a loop to first state hit_num > 3 and next the mean()? However I got stuck here (I still am not sure if I should even do it like this):

for hits in df:
        if df['hit_num'].value_counts() < 3:
            continue
        elif df['hit_num'].value_counts() => 3:
            df.mean()

But I cannot figure out how to say take the mean of the 'score' columns, but only those with hit_num 2 for example.


Solution

  • import numpy as np
    import pandas as pd
    
    # Set random seed for reproducibility
    np.random.seed(42)
    
    # Create random pandas DataFrame
    df = pd.DataFrame({'hit_num': np.random.randint(1, 5, 20), 
                       'score': np.random.randint(1, 500, 20)})
    
    # Group by 'hit_num', filter for those occurring at least 3 times, and calculate mean of 'score'
    result = df.groupby('hit_num').filter(lambda x: len(x) >= 3).groupby('hit_num')['score'].mean().reset_index()
    
    print(result)