Search code examples
pythonpandaswindow-functions

How to simulate windows fuction with partition by in pandas?


I got this data with Nulls in original_eur column.

event_id category rounds_bot_date original_eur
0 1 Category 1 2024-03-25 00:00:00 200
1 1 Category 1 2024-03-25 00:00:00 nan
2 2 Category 2 2024-03-25 00:00:00 nan
3 2 Category 2 2024-03-25 00:00:00 150
4 2 Category 2 2024-03-25 00:00:00 150
5 2 Category 1 2024-03-25 00:00:00 nan
6 3 Category 3 2024-03-25 00:00:00 nan
7 3 Category 2 2024-03-25 00:00:00 150
8 3 Category 3 2024-03-25 00:00:00 60
9 3 Category 2 2024-03-25 00:00:00 150

I need replace each null in columns with median value for appropriate event_id, category, rounds_bot_date.

With SQL I can use case + median window function:

case
    when original_eur = NaN
    then median(original_eur) over(partition by event_id, category, rounds_bot_date)
    else original_eur
end as original_eur

For Pandas I make table with medians:

median_table = (
    dataset
    .groupby(['event_id', 'category', 'rounds_bot_date'])
    .agg(original_eur_median = ('original_eur', 'median'))
    .reset_index()
)

And apply this fuction to dataset:

def fill_na(value, event_id, category, rounds_bot_date, median_table: pd.DataFrame):
    if math.isnan(value):
        value = (
            median_table[
                (median_table['event_id'] == event_id) & 
                (median_table['category'] == category) & 
                (median_table['rounds_bot_date'] == rounds_bot_date)]['original_eur_median'].values[0]
        )
        return value
    else:
        return value


dataset['original_eur'] = (
    dataset
    .apply(
        lambda x: fill_na(x['original_eur'], x['event_id'], x['category'], x['rounds_bot_date'], median_table),
        axis = 1)
)

Is there any way to optimize this code and simulate median window function in Pandas?

P. S. I an make iterrows with same logic, but it's not fast as SQL-function.

Solution:

# add new column with median values
dataset['original_eur_median'] = (
    dataset
    .groupby(['event_id', 'category', 'rounds_bot_date'])['original_eur']
    .transform('median')
)
# fill NaN with median values.
dataset['original_eur'] = dataset['original_eur'].fillna(dataset['original_eur_median'])

Solution

  • Since ur code is kinda complicated and there is no ideal output shown, I guess ur task can be done using pandas.DataFrame.groupby().transform(). Try this:

    df['fixed_eur'] = df.groupby(['event_id', 'category', 'rounds_bot_date'])['original_eur'].transform(lambda x: x.fillna(x.median()))
    

    Note: When using Pandas, Numpy, Polars and alike Python libs, keep in mind that:

    1. The 99.99% of our problems have been faced and solved by other peoples before, so check the API reference first, where u usually could find the ready answer;
    2. Self-made functions always perform much worse than built-in ones, so plz avoid them. If u have to, at least avoid loops in them.
    3. If u r familiar with SQL, I recommend Polars other than Pandas for u. I believe u will quickly get used to it.