Search code examples
pythonpandasdataframefillna

Replace NaN values in a Dataframe with average of same column with the same value from another column


I have a pandas dataframe that consists of multiple columns. For this question we only need to focus on the "Cholesterol" and "Age" column.

In this dataframe each row represents a person and they all have an age. The Cholesterol column has a lot of NaN values and I'm trying to fill them in by taking the mean Cholesterol for their age

E.g. A row with age 37 and Cholesterol NaN how would I replace that with the mean Cholesterol of a 37 year old person

Even better probably would be getting the mean value of their age range.

E.g. A row with age 37 and Cholesterol NaN how would I replace that with the mean Cholesterol of a person between the age of 30-40

As I do not have a lot of values for each specific age


Solution

  • You can use:

    mean = df.groupby(df['Age'] // 10)['Cholesterol'].transform('mean')
    df['Cholesterol2'] = df['Cholesterol'].fillna(mean)
    

    Output:

    >>> df.head(20)
        Age  Cholesterol  Cholesterol2
    0    47     1.623586      1.623586
    1    17     7.035887      7.035887
    2    22          NaN      4.901557  # <- mean of group 2 (20-29)
    3    27     5.033214      5.033214
    4    33     8.508555      8.508555
    5    19     1.284305      1.284305
    6    57     9.915908      9.915908
    7    45          NaN      5.595905  # <- mean of group 4 (40-49)
    8    30     4.665067      4.665067
    9    65     7.578977      7.578977
    10   15     1.044828      1.044828
    11   53     3.775885      3.775885
    12   30     6.010004      6.010004
    13   70     9.855772      9.855772
    14   28     2.715962      2.715962
    15   56     4.552964      4.552964
    16   43     5.128147      5.128147
    17   81     4.848699      4.848699
    18   19     5.480054      5.480054
    19   22     3.049696      3.049696
    

    Note: I created a second column to demonstration purpose only. Of course, you can override the existing column.

    If you want to have more control on groups, you can use pd.cut.

    Minimal Reproducible Example:

    import pandas as pd
    import numpy as np
    
    rng = np.random.default_rng(2023)
    df = pd.DataFrame({'Age': rng.integers(10, 90, 100),
                       'Cholesterol': rng.uniform(1, 10, 100)})
    df.loc[rng.choice(df.index, 10), 'Cholesterol'] = np.nan