I have a pandas dataframe that consists of multiple columns. For this question we only need to focus on the "Cholesterol" and "Age" column.
In this dataframe each row represents a person and they all have an age. The Cholesterol column has a lot of NaN values and I'm trying to fill them in by taking the mean Cholesterol for their age
E.g. A row with age 37 and Cholesterol NaN how would I replace that with the mean Cholesterol of a 37 year old person
Even better probably would be getting the mean value of their age range.
E.g. A row with age 37 and Cholesterol NaN how would I replace that with the mean Cholesterol of a person between the age of 30-40
As I do not have a lot of values for each specific age
You can use:
mean = df.groupby(df['Age'] // 10)['Cholesterol'].transform('mean')
df['Cholesterol2'] = df['Cholesterol'].fillna(mean)
Output:
>>> df.head(20)
Age Cholesterol Cholesterol2
0 47 1.623586 1.623586
1 17 7.035887 7.035887
2 22 NaN 4.901557 # <- mean of group 2 (20-29)
3 27 5.033214 5.033214
4 33 8.508555 8.508555
5 19 1.284305 1.284305
6 57 9.915908 9.915908
7 45 NaN 5.595905 # <- mean of group 4 (40-49)
8 30 4.665067 4.665067
9 65 7.578977 7.578977
10 15 1.044828 1.044828
11 53 3.775885 3.775885
12 30 6.010004 6.010004
13 70 9.855772 9.855772
14 28 2.715962 2.715962
15 56 4.552964 4.552964
16 43 5.128147 5.128147
17 81 4.848699 4.848699
18 19 5.480054 5.480054
19 22 3.049696 3.049696
Note: I created a second column to demonstration purpose only. Of course, you can override the existing column.
If you want to have more control on groups, you can use pd.cut
.
import pandas as pd
import numpy as np
rng = np.random.default_rng(2023)
df = pd.DataFrame({'Age': rng.integers(10, 90, 100),
'Cholesterol': rng.uniform(1, 10, 100)})
df.loc[rng.choice(df.index, 10), 'Cholesterol'] = np.nan