Search code examples
sqlsql-serverdatabasemissing-data

How to replace NULL values with Mean value of a category in SQL?


I have a dataset with null values in the column 'revenues_from_appointment'

Dataset

appointment_date patient_id practitioner_id appointment_duration_min revenues_from_appointment
2021-06-28 42734 748 30 90.0
2021-06-29 42737 747 60 150.0
2021-07-01 42737 747 60 NaN
2021-07-03 42736 748 30 60.0
2021-07-03 42735 747 15 42.62
2021-07-04 42734 748 30 NaN
2021-07-05 42734 748 30 100.0
2021-07-10 42738 747 15 50.72
2021-08-12 42739 748 30 73.43

I wish to replace NULL values by the mean value of rows where "patient_id, practitioner_id, appointment_duration_min" is the same.

I did it using pandas dataframe,

df['revenues_from_appointment'].fillna(df.groupby(['patient_id','practitioner_id','appointment_duration_min'])['revenues_from_appointment'].transform('mean'), inplace = True)

How can we obtain the same result by using SQL?

Final Output

appointment_date patient_id practitioner_id appointment_duration_min revenues_from_appointment
2021-06-28 42734 748 30 90.0
2021-06-29 42737 747 60 150.0
2021-07-01 42737 747 60 150.0
2021-07-03 42736 748 30 60.0
2021-07-03 42735 747 15 42.62
2021-07-04 42734 748 30 95.0
2021-07-05 42734 748 30 100.0
2021-07-10 42738 747 15 50.72
2021-08-12 42739 748 30 73.43

Solution

  • You can use the AVG window function, that will partition on the three column of interest and replace null values using the COALESCE function:

    SELECT appointment_date,
           patient_id,
           practitioner_id,
           appointment_duration_min,
           COALESCE(revenues_from_appointment, 
                    AVG(revenues_from_appointment) OVER(PARTITION BY patient_id, 
                                                                     practitioner_id, 
                                                                     appointment_duration_min))
    FROM tab
    

    Try it here.