I got this data with Nulls in original_eur column.
event_id | category | rounds_bot_date | original_eur | |
---|---|---|---|---|
0 | 1 | Category 1 | 2024-03-25 00:00:00 | 200 |
1 | 1 | Category 1 | 2024-03-25 00:00:00 | nan |
2 | 2 | Category 2 | 2024-03-25 00:00:00 | nan |
3 | 2 | Category 2 | 2024-03-25 00:00:00 | 150 |
4 | 2 | Category 2 | 2024-03-25 00:00:00 | 150 |
5 | 2 | Category 1 | 2024-03-25 00:00:00 | nan |
6 | 3 | Category 3 | 2024-03-25 00:00:00 | nan |
7 | 3 | Category 2 | 2024-03-25 00:00:00 | 150 |
8 | 3 | Category 3 | 2024-03-25 00:00:00 | 60 |
9 | 3 | Category 2 | 2024-03-25 00:00:00 | 150 |
I need replace each null in columns with median value for appropriate event_id, category, rounds_bot_date.
With SQL I can use case + median window function:
case
when original_eur = NaN
then median(original_eur) over(partition by event_id, category, rounds_bot_date)
else original_eur
end as original_eur
For Pandas I make table with medians:
median_table = (
dataset
.groupby(['event_id', 'category', 'rounds_bot_date'])
.agg(original_eur_median = ('original_eur', 'median'))
.reset_index()
)
And apply this fuction to dataset:
def fill_na(value, event_id, category, rounds_bot_date, median_table: pd.DataFrame):
if math.isnan(value):
value = (
median_table[
(median_table['event_id'] == event_id) &
(median_table['category'] == category) &
(median_table['rounds_bot_date'] == rounds_bot_date)]['original_eur_median'].values[0]
)
return value
else:
return value
dataset['original_eur'] = (
dataset
.apply(
lambda x: fill_na(x['original_eur'], x['event_id'], x['category'], x['rounds_bot_date'], median_table),
axis = 1)
)
Is there any way to optimize this code and simulate median window function in Pandas?
P. S. I an make iterrows with same logic, but it's not fast as SQL-function.
Solution:
# add new column with median values
dataset['original_eur_median'] = (
dataset
.groupby(['event_id', 'category', 'rounds_bot_date'])['original_eur']
.transform('median')
)
# fill NaN with median values.
dataset['original_eur'] = dataset['original_eur'].fillna(dataset['original_eur_median'])
Since ur code is kinda complicated and there is no ideal output shown, I guess ur task can be done using pandas.DataFrame.groupby().transform()
. Try this:
df['fixed_eur'] = df.groupby(['event_id', 'category', 'rounds_bot_date'])['original_eur'].transform(lambda x: x.fillna(x.median()))
Note: When using Pandas, Numpy, Polars and alike Python libs, keep in mind that: