I have a dataframe where based on the occurence of a value in column 'hit_num', I need to get the mean value of the column 'scores'. Below is an example of random data I created for this question:
# set random seed for reproducibility
np.random.seed(42)
# crate random pandas dataframe
df = pd.DataFrame({'hit_num': np.random.randint(1,5,20),
'score': np.random.randint(1, 500, 20),
})
df.sort_values(by='hit_num')
Thus, if a value occurs at least three times in the hit_num column. I need to create a new dataframe where with the hit_num value and the average of the scores. examlpe of the data I generated to create an idea Thus hit_num 1 should be skipped, hit_num 2 should get mean of the 4 scores and hit_num 3 should get the mean of 7 scores.
I tried to work with np.mean() or with pandas mean, but I realized that it would just give the mean of the whole column. Next, I searched for column/row specific. I found something here on stackoverflow that showed how to get the mean of specific column/row combinations:
df[['b', 'c']].iloc[[2,4]].mean(axis=0)
but this was for a pre-set rowcolumn combination and I need it randomly to work. I therefore thought I need something like a loop to first state hit_num > 3 and next the mean()? However I got stuck here (I still am not sure if I should even do it like this):
for hits in df:
if df['hit_num'].value_counts() < 3:
continue
elif df['hit_num'].value_counts() => 3:
df.mean()
But I cannot figure out how to say take the mean of the 'score' columns, but only those with hit_num 2 for example.
import numpy as np
import pandas as pd
# Set random seed for reproducibility
np.random.seed(42)
# Create random pandas DataFrame
df = pd.DataFrame({'hit_num': np.random.randint(1, 5, 20),
'score': np.random.randint(1, 500, 20)})
# Group by 'hit_num', filter for those occurring at least 3 times, and calculate mean of 'score'
result = df.groupby('hit_num').filter(lambda x: len(x) >= 3).groupby('hit_num')['score'].mean().reset_index()
print(result)